-
Improving compiler support for SIMD offload using Arm Streaming SVE
Authors:
Mohamed Husain Noor Mohamed,
Adarsh Patil,
Latchesar Ionkov,
Eric Van Hensbergen
Abstract:
The wider adoption of tightly coupled core-adjacent accelerators, such as Arm Scalable Matrix Extension (SME), hinges on lowering software programming complexity. In this paper, we focus on enabling the use of SME architecture in Streaming Scalable Vector Extension (SSVE) mode for workloads written in C/C++. While current compilers optimize loops for all types of SIMD instructions, these technique…
▽ More
The wider adoption of tightly coupled core-adjacent accelerators, such as Arm Scalable Matrix Extension (SME), hinges on lowering software programming complexity. In this paper, we focus on enabling the use of SME architecture in Streaming Scalable Vector Extension (SSVE) mode for workloads written in C/C++. While current compilers optimize loops for all types of SIMD instructions, these techniques primarily target vector units within the core and falter when applied to disaggregated, core-adjacent SIMD accelerators. Our goal is to enable the compiler to automatically generate code for such accelerators only when profitable.
To this end, we investigate a path towards performant, precise, and repeatable computation offloading through two compiler ecosystems. We revisit LLVM compiler passes, MLIR transforms and their associated cost models, and heuristics. We hope that these insights can provide directions for evolving compiler capabilities towards automatic code generation for this next-generation vector processing paradigm.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
UDON: A case for offloading to general purpose compute on CXL memory
Authors:
Jon Hermes,
Josh Minor,
Minjun Wu,
Adarsh Patil,
Eric Van Hensbergen
Abstract:
Upcoming CXL-based disaggregated memory devices feature special purpose units to offload compute to near-memory. In this paper, we explore opportunities for offloading compute to general purpose cores on CXL memory devices, thereby enabling a greater utility and diversity of offload.
We study two classes of popular memory intensive applications: ML inference and vector database as candidates for…
▽ More
Upcoming CXL-based disaggregated memory devices feature special purpose units to offload compute to near-memory. In this paper, we explore opportunities for offloading compute to general purpose cores on CXL memory devices, thereby enabling a greater utility and diversity of offload.
We study two classes of popular memory intensive applications: ML inference and vector database as candidates for computational offload. The study uses Arm AArch64-based dual-socket NUMA systems to emulate CXL type-2 devices.
Our study shows promising results. With our ML inference model partitioning strategy for compute offload, we can place up to 90% data in remote memory with just 20% performance trade-off. Offloading Hierarchical Navigable Small World (HNSW) kernels in vector databases can provide upto 6.87$\times$ performance improvement with under 10% offload overhead.
△ Less
Submitted 3 April, 2024;
originally announced April 2024.
-
Private delegated computations using strong isolation
Authors:
Mathias Brossard,
Guilhem Bryant,
Basma El Gaabouri,
Xinxin Fan,
Alexandre Ferreira,
Edmund Grimley-Evans,
Christopher Haster,
Evan Johnson,
Derek Miller,
Fan Mo,
Dominic P. Mulligan,
Nick Spinale,
Eric van Hensbergen,
Hugo J. M. Vincent,
Shale Xiong
Abstract:
Sensitive computations are now routinely delegated to third-parties. In response, Confidential Computing technologies are being introduced to microprocessors, offering a protected processing environment, which we generically call an isolate, providing confidentiality and integrity guarantees to code and data hosted within -- even in the face of a privileged attacker. Isolates, with an attestation…
▽ More
Sensitive computations are now routinely delegated to third-parties. In response, Confidential Computing technologies are being introduced to microprocessors, offering a protected processing environment, which we generically call an isolate, providing confidentiality and integrity guarantees to code and data hosted within -- even in the face of a privileged attacker. Isolates, with an attestation protocol, permit remote third-parties to establish a trusted "beachhead" containing known code and data on an otherwise untrusted machine. Yet, the rise of these technologies introduces many new problems, including: how to ease provisioning of computations safely into isolates; how to develop distributed systems spanning multiple classes of isolate; and what to do about the billions of "legacy" devices without support for Confidential Computing?
Tackling the problems above, we introduce Veracruz, a framework that eases the design and implementation of complex privacy-preserving, collaborative, delegated computations among a group of mutually mistrusting principals. Veracruz supports multiple isolation technologies and provides a common programming model and attestation protocol across all of them, smoothing deployment of delegated computations over supported technologies. We demonstrate Veracruz in operation, on private in-cloud object detection on encrypted video streaming from a video camera. In addition to supporting hardware-backed isolates -- like AWS Nitro Enclaves and Arm Confidential Computing Architecture Realms -- Veracruz also provides pragmatic "software isolates" on Armv8-A devices without hardware Confidential Computing capability, using the high-assurance seL4 microkernel and our IceCap framework.
△ Less
Submitted 6 May, 2022;
originally announced May 2022.