-
Urysohn width of hypersurfaces and positive macroscopic scalar curvature
Authors:
Teo Gil Moreno de Mora Sardà
Abstract:
We prove that if a complete Riemannian $n$-manifold with non-trivial codimension 1 homology with $\mathbb{Z}_2$-coefficients or $\mathbb{Z}$-coefficients has positive macroscopic scalar curvature large enough, then it contains a non-nullhomologous hypersurface of small Urysohn $(n-2)$-width. This constitutes a macroscopic analogue of a theorem by Bray--Brendle--Neves on the area of non-contractibl…
▽ More
We prove that if a complete Riemannian $n$-manifold with non-trivial codimension 1 homology with $\mathbb{Z}_2$-coefficients or $\mathbb{Z}$-coefficients has positive macroscopic scalar curvature large enough, then it contains a non-nullhomologous hypersurface of small Urysohn $(n-2)$-width. This constitutes a macroscopic analogue of a theorem by Bray--Brendle--Neves on the area of non-contractible 2-spheres in a closed Riemannian 3-manifold with positive scalar curvature. Our proof is based on an adaptation of Guth's macroscopic version of the Schoen-Yau descent argument.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
MATCH: Model-Aware TVM-based Compilation for Heterogeneous Edge Devices
Authors:
Mohamed Amine Hamdi,
Francesco Daghero,
Giuseppe Maria Sarda,
Josse Van Delm,
Arne Symons,
Luca Benini,
Marian Verhelst,
Daniele Jahier Pagliari,
Alessio Burrello
Abstract:
Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field.
The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting to…
▽ More
Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same micro-controller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the TinyML field.
The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting to a different heterogeneous MCU family implies labor-intensive re-development of almost the entire compiler. On the opposite side, retargetable toolchains, such as TVM, fail to exploit the capabilities of custom accelerators, resulting in the generation of general but unoptimized code. To overcome this duality, we introduce MATCH, a novel TVM-based DNN deployment framework designed for easy agile retargeting across different MCU processors and accelerators, thanks to a customizable model-based hardware abstraction.
We show that a general and retargetable mapping framework enhanced with hardware cost models can compete with and even outperform custom toolchains on diverse targets while only needing the definition of an abstract hardware model and a SoC-specific API.
We tested MATCH on two state-of-the-art heterogeneous MCUs, GAP9 and DIANA.
On the four DNN models of the MLPerf Tiny suite MATCH reduces inference latency by up to 60.88 times on DIANA, compared to using the plain TVM, thanks to the exploitation of the on-board HW accelerator. Compared to HTVM, a fully customized toolchain for DIANA, we still reduce the latency by 16.94%. On GAP9, using the same benchmarks, we improve the latency by 2.15 times compared to the dedicated DORY compiler, thanks to our heterogeneous DNN mapping approach that synergically exploits the DNN accelerator and the eight-cores cluster available on board.
△ Less
Submitted 11 October, 2024;
originally announced October 2024.
-
Optimising GPGPU Execution Through Runtime Micro-Architecture Parameter Analysis
Authors:
Giuseppe M. Sarda,
Nimish Shah,
Debjyoti Bhattacharjee,
Peter Debacker,
Marian Verhelst
Abstract:
GPGPU execution analysis has always been tied to closed-source, proprietary benchmarking tools that provide high-level, non-exhaustive, and/or statistical information, preventing a thorough understanding of bottlenecks and optimization possibilities. Open-source hardware platforms offer opportunities to overcome such limits and co-optimize the full {hardware-mapping-algorithm} compute stack. Yet,…
▽ More
GPGPU execution analysis has always been tied to closed-source, proprietary benchmarking tools that provide high-level, non-exhaustive, and/or statistical information, preventing a thorough understanding of bottlenecks and optimization possibilities. Open-source hardware platforms offer opportunities to overcome such limits and co-optimize the full {hardware-mapping-algorithm} compute stack. Yet, so far, this has remained under-explored. In this work, we exploit micro-architecture parameter analysis to develop a hardware-aware, runtime mapping technique for OpenCL kernels on the open Vortex RISC-V GPGPU. Our method is based on trace observations and targets optimal hardware resource utilization to achieve superior performance and flexibility compared to hardware-agnostic mapping approaches. The technique was validated on different architectural GPU configurations across several OpenCL kernels. Overall, our approach significantly enhances the performance of the open-source Vortex GPGPU, contributing to unlocking its potential and usability.
△ Less
Submitted 14 June, 2024;
originally announced July 2024.
-
Complete 3-manifolds of positive scalar curvature with quadratic decay
Authors:
Florent Balacheff,
Teo Gil Moreno de Mora Sardà,
Stéphane Sabourau
Abstract:
We prove that if an orientable 3-manifold $M$ admits a complete Riemannian metric whose scalar curvature is positive and has a subquadratic decay at infinity, then it decomposes as a (possibly infinite) connected sum of spherical manifolds and $\mathbb{S}^2 \times \mathbb{S}^1$ summands. This generalises a theorem of Gromov and Wang by using a different, more topological, approach. As a result, th…
▽ More
We prove that if an orientable 3-manifold $M$ admits a complete Riemannian metric whose scalar curvature is positive and has a subquadratic decay at infinity, then it decomposes as a (possibly infinite) connected sum of spherical manifolds and $\mathbb{S}^2 \times \mathbb{S}^1$ summands. This generalises a theorem of Gromov and Wang by using a different, more topological, approach. As a result, the manifold $M$ carries a complete Riemannian metric of uniformly positive scalar curvature, which partially answers a conjecture of Gromov. More generally, the topological decomposition holds without any scalar curvature assumption under a weaker condition on the filling discs of closed curves in the universal cover based on the notion of fill radius. Moreover, the decay rate of the scalar curvature is optimal in this decomposition theorem. Indeed, the manifold $\mathbb{R}^2 \times \mathbb{S}^1$ supports a complete metric of positive scalar curvature with exactly quadratic decay, but does not admit a decomposition as a connected sum.
△ Less
Submitted 11 May, 2025; v1 submitted 9 July, 2024;
originally announced July 2024.
-
HTVM: Efficient Neural Network Deployment On Heterogeneous TinyML Platforms
Authors:
Josse Van Delm,
Maarten Vandersteegen,
Alessio Burrello,
Giuseppe Maria Sarda,
Francesco Conti,
Daniele Jahier Pagliari,
Luca Benini,
Marian Verhelst
Abstract:
Optimal deployment of deep neural networks (DNNs) on state-of-the-art Systems-on-Chips (SoCs) is crucial for tiny machine learning (TinyML) at the edge. The complexity of these SoCs makes deployment non-trivial, as they typically contain multiple heterogeneous compute cores with limited, programmer-managed memory to optimize latency and energy efficiency. We propose HTVM - a compiler that merges T…
▽ More
Optimal deployment of deep neural networks (DNNs) on state-of-the-art Systems-on-Chips (SoCs) is crucial for tiny machine learning (TinyML) at the edge. The complexity of these SoCs makes deployment non-trivial, as they typically contain multiple heterogeneous compute cores with limited, programmer-managed memory to optimize latency and energy efficiency. We propose HTVM - a compiler that merges TVM with DORY to maximize the utilization of heterogeneous accelerators and minimize data movements. HTVM allows deploying the MLPerf(TM) Tiny suite on DIANA, an SoC with a RISC-V CPU, and digital and analog compute-in-memory AI accelerators, at 120x improved performance over plain TVM deployment.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Precision-aware Latency and Energy Balancing on Multi-Accelerator Platforms for DNN Inference
Authors:
Matteo Risso,
Alessio Burrello,
Giuseppe Maria Sarda,
Luca Benini,
Enrico Macii,
Massimo Poncino,
Marian Verhelst,
Daniele Jahier Pagliari
Abstract:
The need to execute Deep Neural Networks (DNNs) at low latency and low power at the edge has spurred the development of new heterogeneous Systems-on-Chips (SoCs) encapsulating a diverse set of hardware accelerators. How to optimally map a DNN onto such multi-accelerator systems is an open problem. We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerat…
▽ More
The need to execute Deep Neural Networks (DNNs) at low latency and low power at the edge has spurred the development of new heterogeneous Systems-on-Chips (SoCs) encapsulating a diverse set of hardware accelerators. How to optimally map a DNN onto such multi-accelerator systems is an open problem. We propose ODiMO, a hardware-aware tool that performs a fine-grain mapping across different accelerators on-chip, splitting individual layers and executing them in parallel, to reduce inference energy consumption or latency, while taking into account each accelerator's quantization precision to maintain accuracy. Pareto-optimal networks in the accuracy vs. energy or latency space are pursued for three popular dataset/DNN pairs, and deployed on the DIANA heterogeneous ultra-low power edge AI SoC. We show that ODiMO reduces energy/latency by up to 33%/31% with limited accuracy drop (-0.53%/-0.32%) compared to manual heuristic mappings.
△ Less
Submitted 8 June, 2023;
originally announced June 2023.
-
CoNLoCNN: Exploiting Correlation and Non-Uniform Quantization for Energy-Efficient Low-precision Deep Convolutional Neural Networks
Authors:
Muhammad Abdullah Hanif,
Giuseppe Maria Sarda,
Alberto Marchisio,
Guido Masera,
Maurizio Martina,
Muhammad Shafique
Abstract:
In today's era of smart cyber-physical systems, Deep Neural Networks (DNNs) have become ubiquitous due to their state-of-the-art performance in complex real-world applications. The high computational complexity of these networks, which translates to increased energy consumption, is the foremost obstacle towards deploying large DNNs in resource-constrained systems. Fixed-Point (FP) implementations…
▽ More
In today's era of smart cyber-physical systems, Deep Neural Networks (DNNs) have become ubiquitous due to their state-of-the-art performance in complex real-world applications. The high computational complexity of these networks, which translates to increased energy consumption, is the foremost obstacle towards deploying large DNNs in resource-constrained systems. Fixed-Point (FP) implementations achieved through post-training quantization are commonly used to curtail the energy consumption of these networks. However, the uniform quantization intervals in FP restrict the bit-width of data structures to large values due to the need to represent most of the numbers with sufficient resolution and avoid high quantization errors. In this paper, we leverage the key insight that (in most of the scenarios) DNN weights and activations are mostly concentrated near zero and only a few of them have large magnitudes. We propose CoNLoCNN, a framework to enable energy-efficient low-precision deep convolutional neural network inference by exploiting: (1) non-uniform quantization of weights enabling simplification of complex multiplication operations; and (2) correlation between activation values enabling partial compensation of quantization errors at low cost without any run-time overheads. To significantly benefit from non-uniform quantization, we also propose a novel data representation format, Encoded Low-Precision Binary Signed Digit, to compress the bit-width of weights while ensuring direct use of the encoded weight for processing using a novel multiply-and-accumulate (MAC) unit design.
△ Less
Submitted 30 July, 2022;
originally announced August 2022.