-
Balanced segmentation of CNNs for multi-TPU inference
Authors:
Jorge Villarrubia,
Luis Costero,
Francisco D. Igual,
Katzalin Olcoz
Abstract:
In this paper, we propose different alternatives for convolutional neural networks (CNNs) segmentation, addressing inference processes on computing architectures composed by multiple Edge TPUs. Specifically, we compare the inference performance for a number of state-of-the-art CNN models taking as a reference inference times on one TPU and a compiler-based pipelined inference implementation as pro…
▽ More
In this paper, we propose different alternatives for convolutional neural networks (CNNs) segmentation, addressing inference processes on computing architectures composed by multiple Edge TPUs. Specifically, we compare the inference performance for a number of state-of-the-art CNN models taking as a reference inference times on one TPU and a compiler-based pipelined inference implementation as provided by the Google's Edge TPU compiler. Departing from a profiled-based segmentation strategy, we provide further refinements to balance the workload across multiple TPUs, leveraging their cooperative computing power, reducing work imbalance and alleviating the memory access bottleneck due to the limited amount of on-chip memory per TPU. The observed performance results compared with a single TPU yield superlinear speedups and accelerations up to 2.60x compared with the segmentation offered by the compiler targeting multiple TPUs.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
Improving inference time in multi-TPU systems with profiled model segmentation
Authors:
Jorge Villarrubia,
Luis Costero,
Francisco D. Igual,
Katzalin Olcoz
Abstract:
In this paper, we systematically evaluate the inference performance of the Edge TPU by Google for neural networks with different characteristics. Specifically, we determine that, given the limited amount of on-chip memory on the Edge TPU, accesses to external (host) memory rapidly become an important performance bottleneck. We demonstrate how multiple devices can be jointly used to alleviate the b…
▽ More
In this paper, we systematically evaluate the inference performance of the Edge TPU by Google for neural networks with different characteristics. Specifically, we determine that, given the limited amount of on-chip memory on the Edge TPU, accesses to external (host) memory rapidly become an important performance bottleneck. We demonstrate how multiple devices can be jointly used to alleviate the bottleneck introduced by accessing the host memory. We propose a solution combining model segmentation and pipelining on up to four TPUs, with remarkable performance improvements that range from $6\times$ for neural networks with convolutional layers to $46\times$ for fully connected layers, compared with single-TPU setups.
△ Less
Submitted 2 March, 2025;
originally announced March 2025.
-
Teaching Experiences using the RVfpga Package
Authors:
D. Chaver,
S. Harris,
L. Pinuel,
O. Kindgren,
R. Kravitz,
J. I. Gomez,
F. Castro,
K. Olcoz,
J. Villalba,
A. Grinshpun,
F. Gabbay,
L. Seed,
R. Duarte,
M. Lopez,
O. Alonso,
R. Owen
Abstract:
The RVfpga course offers a solid introduction to computer architecture using the RISC-V instruction set and FPGA technology. It focuses on providing hands-on experience with real-world RISC-V cores, the VeeR EH1 and the VeeR EL2, developed by Western Digital a few years ago and currently hosted by ChipsAlliance. This course is particularly aimed at educators and students in computer science, compu…
▽ More
The RVfpga course offers a solid introduction to computer architecture using the RISC-V instruction set and FPGA technology. It focuses on providing hands-on experience with real-world RISC-V cores, the VeeR EH1 and the VeeR EL2, developed by Western Digital a few years ago and currently hosted by ChipsAlliance. This course is particularly aimed at educators and students in computer science, computer engineering, and related fields, enabling them to integrate practical RISC-V knowledge into their curricula. The course materials, which include detailed labs and setup guides, are available for free through the Imagination University Programme website. We have used RVfpga in different teaching activities and we plan to continue using it in the future. Specifically, we have used RVfpga as the main experimental platform in several bachelor/master degree courses; we have completed several final bachelor/master degree projects based on this platform; we will conduct a microcredential about processor design based on RVfpga; we have adapted RVfpga to a MOOC in the edX platform; and we have shared RVfpga worldwide through one-day hands-on workshops and tutorials. This paper begins by discussing how the RVfpga course matches the latest IEEE/ACM/AAAI computing curriculum guidelines. It then details various teaching implementations we have conducted over recent years using these materials. Finally, the paper examines other courses similar to RVfpga, comparing their strengths and weaknesses.
△ Less
Submitted 22 November, 2024;
originally announced November 2024.
-
Advanced simulation-based predictive modelling for solar irradiance sensor farms
Authors:
José L. Risco-Martín,
Ignacio-Iker Prado-Rujas,
Javier Campoy,
María S. Pérez,
Katzalin Olcoz
Abstract:
As solar power continues to grow and replace traditional energy sources, the need for reliable forecasting models becomes increasingly important to ensure the stability and efficiency of the grid. However, the management of these models still needs to be improved, and new tools and technologies are required to handle the deployment and control of solar facilities. This work introduces a novel fram…
▽ More
As solar power continues to grow and replace traditional energy sources, the need for reliable forecasting models becomes increasingly important to ensure the stability and efficiency of the grid. However, the management of these models still needs to be improved, and new tools and technologies are required to handle the deployment and control of solar facilities. This work introduces a novel framework named Cloud-based Analysis and Integration for Data Efficiency (CAIDE), designed for real-time monitoring, management, and forecasting of solar irradiance sensor farms. CAIDE is designed to manage multiple sensor farms simultaneously while improving predictive models in real-time using well-grounded Modeling and Simulation (M&S) methodologies. The framework leverages Model Based Systems Engineering (MBSE) and an Internet of Things (IoT) infrastructure to support the deployment and analysis of solar plants in dynamic environments. The system can adapt and re-train the model when given incorrect results, ensuring that forecasts remain accurate and up-to-date. Furthermore, CAIDE can be executed in sequential, parallel, and distributed architectures, assuring scalability. The effectiveness of CAIDE is demonstrated in a complex scenario composed of several solar irradiance sensor farms connected to a centralized management system. Our results show that CAIDE is scalable and effective in managing and forecasting solar power production while improving the accuracy of predictive models in real time. The framework has important implications for the deployment of solar plants and the future of renewable energy sources.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Energy efficiency optimization of task-parallel codes on asymmetric architectures
Authors:
Luis Costero,
Francisco D. Igual,
Katzalin Olcoz,
Francisco Tirado
Abstract:
We present a family of policies that, integrated within a runtime task scheduler (Nanox), pursue the goal of improving the energy efficiency of task-parallel executions with no intervention from the programmer. The proposed policies tackle the problem by modifying the core operating frequency via DVFS mechanisms, or by enabling/disabling the mapping of tasks to specific cores at selected execution…
▽ More
We present a family of policies that, integrated within a runtime task scheduler (Nanox), pursue the goal of improving the energy efficiency of task-parallel executions with no intervention from the programmer. The proposed policies tackle the problem by modifying the core operating frequency via DVFS mechanisms, or by enabling/disabling the mapping of tasks to specific cores at selected execution points, depending on the internal status of the scheduler. Experimental results on an asymmetric SoC (Exynos 5422) and for a specific operation (Cholesky factorization) reveal gains up to 29% in terms of energy efficiency and considerable reductions in average power.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
Leveraging knowledge-as-a-service (KaaS) for QoS-aware resource management in multi-user video transcoding
Authors:
Luis Costero,
Francisco D. Igual,
Katzalin Olcoz,
Francisco Tirado
Abstract:
The coexistence of parallel applications in shared computing nodes, each one featuring different Quality of Service (QoS) requirements, carries out new challenges to improve resource occupation while keeping acceptable rates in terms of QoS. As more application-specific and system-wide metrics are included as QoS dimensions, or under situations in which resource-usage limits are strict, building a…
▽ More
The coexistence of parallel applications in shared computing nodes, each one featuring different Quality of Service (QoS) requirements, carries out new challenges to improve resource occupation while keeping acceptable rates in terms of QoS. As more application-specific and system-wide metrics are included as QoS dimensions, or under situations in which resource-usage limits are strict, building and serving the most appropriate set of actions (application control knobs and system resource assignment) to concurrent applications in an automatic and optimal fashion becomes mandatory. In this paper, we propose strategies to build and serve this type of knowledge to concurrent applications by leveraging Reinforcement Learning techniques. Taking multi-user video transcoding as a driving example, our experimental results reveal an excellent adaptation of resource and knob management to heterogeneous QoS requests, and increases in the amount of concurrently served users up to 1.24x compared with alternative approaches considering homogeneous QoS requests.
△ Less
Submitted 7 February, 2024;
originally announced February 2024.
-
Optimization of a Line Detection Algorithm for Autonomous Vehicles on a RISC-V with Accelerator
Authors:
María José Belda,
Katzalin Olcoz,
Fernando Castro,
Francisco Tirado
Abstract:
In recent years, autonomous vehicles have attracted the attention of many research groups, both in academia and business, including researchers from leading companies such as Google, Uber and Tesla. This type of vehicles are equipped with systems that are subject to very strict requirements, essentially aimed at performing safe operations -- both for potential passengers and pedestrians -- as well…
▽ More
In recent years, autonomous vehicles have attracted the attention of many research groups, both in academia and business, including researchers from leading companies such as Google, Uber and Tesla. This type of vehicles are equipped with systems that are subject to very strict requirements, essentially aimed at performing safe operations -- both for potential passengers and pedestrians -- as well as carrying out the processing needed for decision making in real time. In many instances, general-purpose processors alone cannot ensure that these safety, reliability and real-time requirements are met, so it is common to implement heterogeneous systems by including accelerators. This paper explores the acceleration of a line detection application in the autonomous car environment using a heterogeneous system consisting of a general-purpose RISC-V core and a domain-specific accelerator. In particular, the application is analyzed to identify the most computationally intensive parts of the code and it is adapted accordingly for more efficient processing. Furthermore, the code is executed on the aforementioned hardware platform to verify that the execution effectively meets the existing requirements in autonomous vehicles, experiencing a 3.7x speedup with respect to running without accelerator.
△ Less
Submitted 1 February, 2024;
originally announced February 2024.
-
A Unified Cloud-Enabled Discrete Event Parallel and Distributed Simulation Architecture
Authors:
José L. Risco-Martín,
Kevin Henares,
Saurabh Mittal,
Luis F. Almendras,
Katzalin Olcoz
Abstract:
Cloud simulation environments today are largely employed to model and simulate complex systems for remote accessibility and variable capacity requirements. In this regard, scalability issues in Modeling and Simulation (M\&S) computational requirements can be tackled through the elasticity of on-demand Cloud deployment. However, implementing a high performance cloud M\&S framework following these e…
▽ More
Cloud simulation environments today are largely employed to model and simulate complex systems for remote accessibility and variable capacity requirements. In this regard, scalability issues in Modeling and Simulation (M\&S) computational requirements can be tackled through the elasticity of on-demand Cloud deployment. However, implementing a high performance cloud M\&S framework following these elastic principles is not a trivial task as parallelizing and distributing existing architectures is challenging. Indeed, both the parallel and distributed M\&S developments have evolved following separate ways. Parallel solutions has always been focused on ad-hoc solutions, while distributed approaches, on the other hand, have led to the definition of standard distributed frameworks like the High Level Architecture (HLA) or influenced the use of distributed technologies like the Message Passing Interface (MPI). Only a few developments have been able to evolve with the current resilience of computing hardware resources deployment, largely focused on the implementation of Simulation as a Service (SaaS), albeit independently of the parallel ad-hoc methods branch. In this paper, we present a unified parallel and distributed M\&S architecture with enough flexibility to deploy parallel and distributed simulations in the Cloud with a low effort, without modifying the underlying model source code, and reaching important speedups against the sequential simulation, especially in the parallel implementation. Our framework is based on the Discrete Event System Specification (DEVS) formalism. The performance of the parallel and distributed framework is tested using the xDEVS M\&S tool, Application Programming Interface (API) and the DEVStone benchmark with up to eight computing nodes, obtaining maximum speedups of $15.95\times$ and $1.84\times$, respectively.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Detecting time-fragmented cache attacks against AES using Performance Monitoring Counters
Authors:
Iván Prada,
Francisco D. Igual,
Katzalin Olcoz
Abstract:
Cache timing attacks use shared caches in multi-core processors as side channels to extract information from victim processes. These attacks are particularly dangerous in cloud infrastructures, in which the deployed countermeasures cause collateral effects in terms of performance loss and increase in energy consumption. We propose to monitor the victim process using an independent monitoring (dete…
▽ More
Cache timing attacks use shared caches in multi-core processors as side channels to extract information from victim processes. These attacks are particularly dangerous in cloud infrastructures, in which the deployed countermeasures cause collateral effects in terms of performance loss and increase in energy consumption. We propose to monitor the victim process using an independent monitoring (detector) process, that continuously measures selected Performance Monitoring Counters (PMC) to detect the presence of an attack. Ad-hoc countermeasures can be applied only when such a risky situation arises. In our case, the victim process is the AES encryption algorithm and the attack is performed by means of random encryption requests. We demonstrate that PMCs are a feasible tool to detect the attack and that sampling PMCs at high frequencies is worse than sampling at lower frequencies in terms of detection capabilities, particularly when the attack is fragmented in time to try to be hidden from detection.
△ Less
Submitted 25 April, 2019;
originally announced April 2019.
-
Revisiting Conventional Task Schedulers to Exploit Asymmetry in ARM big.LITTLE Architectures for Dense Linear Algebra
Authors:
Luis Costero,
Francisco D. Igual,
Katzalin Olcoz,
Enrique S. Quintana-Ortí
Abstract:
Dealing with asymmetry in the architecture opens a plethora of questions from the perspective of scheduling task-parallel applications, and there exist early attempts to address this problem via ad-hoc strategies embedded into a runtime framework. In this paper we take a different path, which consists in addressing the complexity of the problem at the library level, via a few asymmetry-aware funda…
▽ More
Dealing with asymmetry in the architecture opens a plethora of questions from the perspective of scheduling task-parallel applications, and there exist early attempts to address this problem via ad-hoc strategies embedded into a runtime framework. In this paper we take a different path, which consists in addressing the complexity of the problem at the library level, via a few asymmetry-aware fundamental kernels, hiding the architecture heterogeneity from the task scheduler. For the specific domain of dense linear algebra, we show that this is not only possible but delivers much higher performance than a naive approach based on an asymmetry-oblivious scheduler. Furthermore, this solution also outperforms an ad-hoc asymmetry-aware scheduler furnished with sophisticated scheduling techniques.
△ Less
Submitted 7 September, 2015;
originally announced September 2015.