Search | arXiv e-print repository

Urysohn width of hypersurfaces and positive macroscopic scalar curvature

Abstract: We prove that if a complete Riemannian $n$-manifold with non-trivial codimension 1 homology with $\mathbb{Z}_2$-coefficients or $\mathbb{Z}$-coefficients has positive macroscopic scalar curvature large enough, then it contains a non-nullhomologous hypersurface of small Urysohn $(n-2)$-width. This constitutes a macroscopic analogue of a theorem by Bray--Brendle--Neves on the area of non-contractibl… ▽ More We prove that if a complete Riemannian $n$-manifold with non-trivial codimension 1 homology with $\mathbb{Z}_2$-coefficients or $\mathbb{Z}$-coefficients has positive macroscopic scalar curvature large enough, then it contains a non-nullhomologous hypersurface of small Urysohn $(n-2)$-width. This constitutes a macroscopic analogue of a theorem by Bray--Brendle--Neves on the area of non-contractible 2-spheres in a closed Riemannian 3-manifold with positive scalar curvature. Our proof is based on an adaptation of Guth's macroscopic version of the Schoen-Yau descent argument. △ Less

Submitted 9 April, 2025; originally announced April 2025.

Comments: 12 pages, 3 figures

MSC Class: Primary 53C23; Secondary 53C21

arXiv:2410.08855 [pdf, other]

MATCH: Model-Aware TVM-based Compilation for Heterogeneous Edge Devices

Authors: Mohamed Amine Hamdi, Francesco Daghero, Giuseppe Maria Sarda, Josse Van Delm, Arne Symons, Luca Benini, Marian Verhelst, Daniele Jahier Pagliari, Alessio Burrello

Abstract: Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field. The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting to… ▽ More Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field. The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting to a different heterogeneous MCU family implies labor-intensive re-development of almost the entire compiler. On the opposite side, retargetable toolchains, such as TVM, fail to exploit the capabilities of custom accelerators, resulting in the generation of general but unoptimized code. To overcome this duality, we introduce MATCH, a novel TVM-based DNN deployment framework designed for easy agile retargeting across different MCU processors and accelerators, thanks to a customizable model-based hardware abstraction. We show that a general and retargetable mapping framework enhanced with hardware cost models can compete with and even outperform custom toolchains on diverse targets while only needing the definition of an abstract hardware model and a SoC-specific API. We tested MATCH on two state-of-the-art heterogeneous MCUs, GAP9 and DIANA. On the four DNN models of the MLPerf Tiny suite MATCH reduces inference latency by up to 60.88 times on DIANA, compared to using the plain TVM, thanks to the exploitation of the on-board HW accelerator. Compared to HTVM, a fully customized toolchain for DIANA, we still reduce the latency by 16.94%. On GAP9, using the same benchmarks, we improve the latency by 2.15 times compared to the dedicated DORY compiler, thanks to our heterogeneous DNN mapping approach that synergically exploits the DNN accelerator and the eight-cores cluster available on board. △ Less

Submitted 11 October, 2024; originally announced October 2024.

Comments: 13 pages, 11 figures, 4 tables

ACM Class: I.2.2; D.1.3

arXiv:2407.11999 [pdf, other]

doi 10.1109/IISWC59245.2023.00017

Optimising GPGPU Execution Through Runtime Micro-Architecture Parameter Analysis

Authors: Giuseppe M. Sarda, Nimish Shah, Debjyoti Bhattacharjee, Peter Debacker, Marian Verhelst

Abstract: GPGPU execution analysis has always been tied to closed-source, proprietary benchmarking tools that provide high-level, non-exhaustive, and/or statistical information, preventing a thorough understanding of bottlenecks and optimization possibilities. Open-source hardware platforms offer opportunities to overcome such limits and co-optimize the full {hardware-mapping-algorithm} compute stack. Yet,… ▽ More GPGPU execution analysis has always been tied to closed-source, proprietary benchmarking tools that provide high-level, non-exhaustive, and/or statistical information, preventing a thorough understanding of bottlenecks and optimization possibilities. Open-source hardware platforms offer opportunities to overcome such limits and co-optimize the full {hardware-mapping-algorithm} compute stack. Yet, so far, this has remained under-explored. In this work, we exploit micro-architecture parameter analysis to develop a hardware-aware, runtime mapping technique for OpenCL kernels on the open Vortex RISC-V GPGPU. Our method is based on trace observations and targets optimal hardware resource utilization to achieve superior performance and flexibility compared to hardware-agnostic mapping approaches. The technique was validated on different architectural GPU configurations across several OpenCL kernels. Overall, our approach significantly enhances the performance of the open-source Vortex GPGPU, contributing to unlocking its potential and usability. △ Less

Submitted 14 June, 2024; originally announced July 2024.

Journal ref: 2023 IEEE International Symposium on Workload Characterization (IISWC)

arXiv:2407.07198 [pdf, other]

Complete 3-manifolds of positive scalar curvature with quadratic decay

Authors: Florent Balacheff, Teo Gil Moreno de Mora Sardà, Stéphane Sabourau

Abstract: We prove that if an orientable 3-manifold $M$ admits a complete Riemannian metric whose scalar curvature is positive and has a subquadratic decay at infinity, then it decomposes as a (possibly infinite) connected sum of spherical manifolds and $\mathbb{S}^2 \times \mathbb{S}^1$ summands. This generalises a theorem of Gromov and Wang by using a different, more topological, approach. As a result, th… ▽ More We prove that if an orientable 3-manifold $M$ admits a complete Riemannian metric whose scalar curvature is positive and has a subquadratic decay at infinity, then it decomposes as a (possibly infinite) connected sum of spherical manifolds and $\mathbb{S}^2 \times \mathbb{S}^1$ summands. This generalises a theorem of Gromov and Wang by using a different, more topological, approach. As a result, the manifold $M$ carries a complete Riemannian metric of uniformly positive scalar curvature, which partially answers a conjecture of Gromov. More generally, the topological decomposition holds without any scalar curvature assumption under a weaker condition on the filling discs of closed curves in the universal cover based on the notion of fill radius. Moreover, the decay rate of the scalar curvature is optimal in this decomposition theorem. Indeed, the manifold $\mathbb{R}^2 \times \mathbb{S}^1$ supports a complete metric of positive scalar curvature with exactly quadratic decay, but does not admit a decomposition as a connected sum. △ Less

Submitted 11 May, 2025; v1 submitted 9 July, 2024; originally announced July 2024.

Comments: 24 pages, 8 figures. To appear in Mathematische Annalen

MSC Class: Primary 53C23; Secondary 53C21

arXiv:2406.07453 [pdf, other]

doi 10.1109/DAC56929.2023.10247664

HTVM: Efficient Neural Network Deployment On Heterogeneous TinyML Platforms

Authors: Josse Van Delm, Maarten Vandersteegen, Alessio Burrello, Giuseppe Maria Sarda, Francesco Conti, Daniele Jahier Pagliari, Luca Benini, Marian Verhelst

Abstract: Optimal deployment of deep neural networks (DNNs) on state-of-the-art Systems-on-Chips (SoCs) is crucial for tiny machine learning (TinyML) at the edge. The complexity of these SoCs makes deployment non-trivial, as they typically contain multiple heterogeneous compute cores with limited, programmer-managed memory to optimize latency and energy efficiency. We propose HTVM - a compiler that merges T… ▽ More Optimal deployment of deep neural networks (DNNs) on state-of-the-art Systems-on-Chips (SoCs) is crucial for tiny machine learning (TinyML) at the edge. The complexity of these SoCs makes deployment non-trivial, as they typically contain multiple heterogeneous compute cores with limited, programmer-managed memory to optimize latency and energy efficiency. We propose HTVM - a compiler that merges TVM with DORY to maximize the utilization of heterogeneous accelerators and minimize data movements. HTVM allows deploying the MLPerf(TM) Tiny suite on DIANA, an SoC with a RISC-V CPU, and digital and analog compute-in-memory AI accelerators, at 120x improved performance over plain TVM deployment. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: Presented at DAC2023. Open-source code is available at https://github.com/KULeuven-MICAS/htvm

ACM Class: D.3.4

Journal ref: 2023 60th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2023, pp. 1-6

arXiv:2306.05060 [pdf, other]

Precision-aware Latency and Energy Balancing on Multi-Accelerator Platforms for DNN Inference

Authors: Matteo Risso, Alessio Burrello, Giuseppe Maria Sarda, Luca Benini, Enrico Macii, Massimo Poncino, Marian Verhelst, Daniele Jahier Pagliari

Abstract: The need to execute Deep Neural Networks (DNNs) at low latency and low power at the edge has spurred the development of new heterogeneous Systems-on-Chips (SoCs) encapsulating a diverse set of hardware accelerators. How to optimally map a DNN onto such multi-accelerator systems is an open problem. We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerat… ▽ More The need to execute Deep Neural Networks (DNNs) at low latency and low power at the edge has spurred the development of new heterogeneous Systems-on-Chips (SoCs) encapsulating a diverse set of hardware accelerators. How to optimally map a DNN onto such multi-accelerator systems is an open problem. We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerators on-chip, splitting individual layers and executing them in parallel, to reduce inference energy consumption or latency, while taking into account each accelerator's quantization precision to maintain accuracy. Pareto-optimal networks in the accuracy vs. energy or latency space are pursued for three popular dataset/DNN pairs, and deployed on the DIANA heterogeneous ultra-low power edge AI SoC. We show that ODiMO reduces energy/latency by up to 33%/31% with limited accuracy drop (-0.53%/-0.32%) compared to manual heuristic mappings. △ Less

Submitted 8 June, 2023; originally announced June 2023.

Comments: Accepted at 2023 ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED)

arXiv:2208.00331 [pdf, other]

CoNLoCNN: Exploiting Correlation and Non-Uniform Quantization for Energy-Efficient Low-precision Deep Convolutional Neural Networks

Authors: Muhammad Abdullah Hanif, Giuseppe Maria Sarda, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique

Abstract: In today's era of smart cyber-physical systems, Deep Neural Networks (DNNs) have become ubiquitous due to their state-of-the-art performance in complex real-world applications. The high computational complexity of these networks, which translates to increased energy consumption, is the foremost obstacle towards deploying large DNNs in resource-constrained systems. Fixed-Point (FP) implementations… ▽ More In today's era of smart cyber-physical systems, Deep Neural Networks (DNNs) have become ubiquitous due to their state-of-the-art performance in complex real-world applications. The high computational complexity of these networks, which translates to increased energy consumption, is the foremost obstacle towards deploying large DNNs in resource-constrained systems. Fixed-Point (FP) implementations achieved through post-training quantization are commonly used to curtail the energy consumption of these networks. However, the uniform quantization intervals in FP restrict the bit-width of data structures to large values due to the need to represent most of the numbers with sufficient resolution and avoid high quantization errors. In this paper, we leverage the key insight that (in most of the scenarios) DNN weights and activations are mostly concentrated near zero and only a few of them have large magnitudes. We propose CoNLoCNN, a framework to enable energy-efficient low-precision deep convolutional neural network inference by exploiting: (1) non-uniform quantization of weights enabling simplification of complex multiplication operations; and (2) correlation between activation values enabling partial compensation of quantization errors at low cost without any run-time overheads. To significantly benefit from non-uniform quantization, we also propose a novel data representation format, Encoded Low-Precision Binary Signed Digit, to compress the bit-width of weights while ensuring direct use of the encoded weight for processing using a novel multiply-and-accumulate (MAC) unit design. △ Less

Submitted 30 July, 2022; originally announced August 2022.

Comments: 8 pages, 15 figures, 2 tables

Showing 1–7 of 7 results for author: Sarda, G M