-
Scheduling Strategies for Partially-Replicable Task Chains on Two Types of Resources
Authors:
Diane Orhan,
Yacine Idouar,
Laércio Lima Pilla,
Adrien Cassagne,
Denis Barthou,
Christophe Jego
Abstract:
The arrival of heterogeneous (or hybrid) multicore architectures on parallel platforms has brought new performance opportunities for applications and efficiency opportunities to systems. They have also increased the challenges related to thread scheduling, as tasks' execution times will vary depending if they are placed in big (performance) cores or little (efficient) ones. In this paper, we focus…
▽ More
The arrival of heterogeneous (or hybrid) multicore architectures on parallel platforms has brought new performance opportunities for applications and efficiency opportunities to systems. They have also increased the challenges related to thread scheduling, as tasks' execution times will vary depending if they are placed in big (performance) cores or little (efficient) ones. In this paper, we focus on the challenges heterogeneous multicore problems bring to partially-replicable task chains, such as the ones that implement digital communication standards in Software-Defined Radio (SDR). Our objective is to maximize the throughput of these task chains while also minimizing their power consumption. We model this problem as a pipelined workflow scheduling problem using pipelined and replicated parallelism on two types of resources whose objectives are to minimize the period and to use as many little cores as necessary. We propose two greedy heuristics (FERTAC and 2CATAC) and one optimal dynamic programming (HeRAD) solution to the problem. We evaluate our solutions and compare the quality of their schedules (in period and resource utilization) and their execution times using synthetic task chains and an implementation of the DVB-S2 communication standard running on StreamPU. Our results demonstrate the benefits and drawbacks of the different proposed solutions. On average, FERTAC and 2CATAC achieve near-optimal solutions, with periods that are less than 10% worse than the optimal (HeRAD) using fewer than 2 extra cores. These three scheduling strategies now enable programmers and users of StreamPU to transparently make use of heterogeneous multicore processors and achieve throughputs that differ from their theoretical maximums by less than 8% on average.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Scheduling Algorithms for Federated Learning with Minimal Energy Consumption
Authors:
Laércio Lima Pilla
Abstract:
Federated Learning (FL) has opened the opportunity for collaboratively training machine learning models on heterogeneous mobile or Edge devices while keeping local data private.With an increase in its adoption, a growing concern is related to its economic and environmental cost (as is also the case for other machine learning techniques).Unfortunately, little work has been done to optimize its ener…
▽ More
Federated Learning (FL) has opened the opportunity for collaboratively training machine learning models on heterogeneous mobile or Edge devices while keeping local data private.With an increase in its adoption, a growing concern is related to its economic and environmental cost (as is also the case for other machine learning techniques).Unfortunately, little work has been done to optimize its energy consumption or emissions of carbon dioxide or equivalents, as energy minimization is usually left as a secondary objective.In this paper, we investigate the problem of minimizing the energy consumption of FL training on heterogeneous devices by controlling the workload distribution.We model this as the Minimal Cost FL Schedule problem, a total cost minimization problem with identical, independent, and atomic tasks that have to be assigned to heterogeneous resources with arbitrary cost functions.We propose a pseudo-polynomial optimal solution to the problem based on the previously unexplored Multiple-Choice Minimum-Cost Maximal Knapsack Packing Problem.We also provide four algorithms for scenarios where cost functions are monotonically increasing and follow the same behavior.These solutions are likewise applicable on the minimization of other kinds of costs, and in other one-dimensional data partition problems.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
Optimal Task Assignment to Heterogeneous Federated Learning Devices
Authors:
Laércio Lima Pilla
Abstract:
Federated Learning provides new opportunities for training machine learning models while respecting data privacy. This technique is based on heterogeneous devices that work together to iteratively train a model while never sharing their own data. Given the synchronous nature of this training, the performance of Federated Learning systems is dictated by the slowest devices, also known as stragglers…
▽ More
Federated Learning provides new opportunities for training machine learning models while respecting data privacy. This technique is based on heterogeneous devices that work together to iteratively train a model while never sharing their own data. Given the synchronous nature of this training, the performance of Federated Learning systems is dictated by the slowest devices, also known as stragglers. In this paper, we investigate the problem of minimizing the duration of Federated Learning rounds by controlling how much data each device uses for training. We formulate this problem as a makespan minimization problem with identical, independent, and atomic tasks that have to be assigned to heterogeneous resources with non-decreasing cost functions while respecting lower and upper limits of tasks per resource. Based on this formulation, we propose a polynomial-time algorithm named OLAR and prove that it provides optimal schedules. We evaluate OLAR in an extensive experimental evaluation using simulation that includes comparisons to other algorithms from the state of the art and new extensions to them. Our results indicate that OLAR provides optimal solutions with a small execution time. They also show that the presence of lower and upper limits of tasks per resource erase any benefits that suboptimal heuristics could provide in terms of algorithm execution time.
△ Less
Submitted 1 October, 2020;
originally announced October 2020.
-
Mapping Matters: Application Process Mapping on 3-D Processor Topologies
Authors:
Jonas H. Müller Korndörfer,
Mario Bielert,
Laércio L. Pilla,
Florina M. Ciorba
Abstract:
Applications' performance is influenced by the mapping of processes to computing nodes, the frequency and volume of exchanges among processing elements, the network capacity, and the routing protocol. A poor mapping of application processes degrades performance and wastes resources. Process mapping is frequently ignored as an explicit optimization step since the system typically offers a default m…
▽ More
Applications' performance is influenced by the mapping of processes to computing nodes, the frequency and volume of exchanges among processing elements, the network capacity, and the routing protocol. A poor mapping of application processes degrades performance and wastes resources. Process mapping is frequently ignored as an explicit optimization step since the system typically offers a default mapping, users may lack awareness of their applications' communication behavior, and the opportunities for improving performance through mapping are often unclear. This work studies the impact of application process mapping on several processor topologies. We propose a workflow that renders mapping as an explicit optimization step for parallel applications. We apply the workflow to a set of four applications, twelve mapping algorithms, and three direct network topologies. We assess the mappings' quality in terms of volume, frequency, and distance of exchanges using metrics such as dilation (measured in hop$\cdot$Byte). With a parallel trace-based simulator, we predict the applications' execution on the three topologies using the twelve mappings. We evaluate the impact of process mapping on the applications' simulated performance in terms of execution and communication times and identify the mappings that achieve the highest performance in both cases. To ensure the correctness of the simulations, we compare the pre- and post-simulation results. This work emphasizes the importance of process mapping as an explicit optimization step and offers a solution for parallel applications to exploit the full potential of the allocated resources on a given system.
△ Less
Submitted 10 March, 2021; v1 submitted 20 May, 2020;
originally announced May 2020.