-
Self-Regulating Random Walks for Resilient Decentralized Learning on Graphs
Authors:
Maximilian Egger,
Rawad Bitar,
Ghadir Ayache,
Antonia Wachter-Zeh,
Salim El Rouayheb
Abstract:
Consider the setting of multiple random walks (RWs) on a graph executing a certain computational task. For instance, in decentralized learning via RWs, a model is updated at each iteration based on the local data of the visited node and then passed to a randomly chosen neighbor. RWs can fail due to node or link failures. The goal is to maintain a desired number of RWs to ensure failure resilience.…
▽ More
Consider the setting of multiple random walks (RWs) on a graph executing a certain computational task. For instance, in decentralized learning via RWs, a model is updated at each iteration based on the local data of the visited node and then passed to a randomly chosen neighbor. RWs can fail due to node or link failures. The goal is to maintain a desired number of RWs to ensure failure resilience. Achieving this is challenging due to the lack of a central entity to track which RWs have failed to replace them with new ones by forking (duplicating) surviving ones. Without duplications, the number of RWs will eventually go to zero, causing a catastrophic failure of the system. We propose two decentralized algorithms called DecAFork and DecAFork+ that can maintain the number of RWs in the graph around a desired value even in the presence of arbitrary RW failures. Nodes continuously estimate the number of surviving RWs by estimating their return time distribution and fork the RWs when failures are likely to happen. DecAFork+ additionally allows terminations to avoid overloading the network by forking too many RWs. We present extensive numerical simulations that show the performance of DecAFork and DecAFork+ regarding fast detection and reaction to failures compared to a baseline, and establish theoretical guarantees on the performance of both algorithms.
△ Less
Submitted 10 February, 2025; v1 submitted 16 July, 2024;
originally announced July 2024.
-
Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers
Authors:
Serge Kas Hanna,
Rawad Bitar,
Parimal Parag,
Venkat Dasari,
Salim El Rouayheb
Abstract:
We consider the setting where a master wants to run a distributed stochastic gradient descent (SGD) algorithm on $n$ workers each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays. One solution studied in the literature is to wait at each iteration for the responses of the fastest $k<n$ workers before updating…
▽ More
We consider the setting where a master wants to run a distributed stochastic gradient descent (SGD) algorithm on $n$ workers each having a subset of the data. Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays. One solution studied in the literature is to wait at each iteration for the responses of the fastest $k<n$ workers before updating the model, where $k$ is a fixed parameter. The choice of the value of $k$ presents a trade-off between the runtime (i.e., convergence rate) of SGD and the error of the model. Towards optimizing the error-runtime trade-off, we investigate distributed SGD with adaptive $k$. We first design an adaptive policy for varying $k$ that optimizes this trade-off based on an upper bound on the error as a function of the wall-clock time which we derive. Then, we propose an algorithm for adaptive distributed SGD that is based on a statistical heuristic. We implement our algorithm and provide numerical simulations which confirm our intuition and theoretical analysis.
△ Less
Submitted 25 February, 2020;
originally announced February 2020.
-
Advances and Open Problems in Federated Learning
Authors:
Peter Kairouz,
H. Brendan McMahan,
Brendan Avent,
Aurélien Bellet,
Mehdi Bennis,
Arjun Nitin Bhagoji,
Kallista Bonawitz,
Zachary Charles,
Graham Cormode,
Rachel Cummings,
Rafael G. L. D'Oliveira,
Hubert Eichner,
Salim El Rouayheb,
David Evans,
Josh Gardner,
Zachary Garrett,
Adrià Gascón,
Badih Ghazi,
Phillip B. Gibbons,
Marco Gruteser,
Zaid Harchaoui,
Chaoyang He,
Lie He,
Zhouyuan Huo,
Ben Hutchinson
, et al. (34 additional authors not shown)
Abstract:
Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs re…
▽ More
Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.
△ Less
Submitted 8 March, 2021; v1 submitted 10 December, 2019;
originally announced December 2019.