-
SWIFT: Rapid Decentralized Federated Learning via Wait-Free Model Communication
Authors:
Marco Bornstein,
Tahseen Rabbani,
Evan Wang,
Amrit Singh Bedi,
Furong Huang
Abstract:
The decentralized Federated Learning (FL) setting avoids the role of a potentially unreliable or untrustworthy central host by utilizing groups of clients to collaboratively train a model via localized training and model/gradient sharing. Most existing decentralized FL algorithms require synchronization of client models where the speed of synchronization depends upon the slowest client. In this wo…
▽ More
The decentralized Federated Learning (FL) setting avoids the role of a potentially unreliable or untrustworthy central host by utilizing groups of clients to collaboratively train a model via localized training and model/gradient sharing. Most existing decentralized FL algorithms require synchronization of client models where the speed of synchronization depends upon the slowest client. In this work, we propose SWIFT: a novel wait-free decentralized FL algorithm that allows clients to conduct training at their own speed. Theoretically, we prove that SWIFT matches the gold-standard iteration convergence rate $\mathcal{O}(1/\sqrt{T})$ of parallel stochastic gradient descent for convex and non-convex smooth optimization (total iterations $T$). Furthermore, we provide theoretical results for IID and non-IID settings without any bounded-delay assumption for slow clients which is required by other asynchronous decentralized FL algorithms. Although SWIFT achieves the same iteration convergence rate with respect to $T$ as other state-of-the-art (SOTA) parallel stochastic algorithms, it converges faster with respect to run-time due to its wait-free structure. Our experimental results demonstrate that SWIFT's run-time is reduced due to a large reduction in communication time per epoch, which falls by an order of magnitude compared to synchronous counterparts. Furthermore, SWIFT produces loss levels for image classification, over IID and non-IID data settings, upwards of 50% faster than existing SOTA algorithms.
△ Less
Submitted 25 October, 2022;
originally announced October 2022.
-
Practical and Fast Momentum-Based Power Methods
Authors:
Tahseen Rabbani,
Apollo Jain,
Arjun Rajkumar,
Furong Huang
Abstract:
The power method is a classical algorithm with broad applications in machine learning tasks, including streaming PCA, spectral clustering, and low-rank matrix approximation. The distilled purpose of the vanilla power method is to determine the largest eigenvalue (in absolute modulus) and its eigenvector of a matrix. A momentum-based scheme can be used to accelerate the power method, but achieving…
▽ More
The power method is a classical algorithm with broad applications in machine learning tasks, including streaming PCA, spectral clustering, and low-rank matrix approximation. The distilled purpose of the vanilla power method is to determine the largest eigenvalue (in absolute modulus) and its eigenvector of a matrix. A momentum-based scheme can be used to accelerate the power method, but achieving an optimal convergence rate with existing algorithms critically relies on additional spectral information that is unavailable at run-time, and sub-optimal initializations can result in divergence. In this paper, we provide a pair of novel momentum-based power methods, which we call the delayed momentum power method (DMPower) and a streaming variant, the delayed momentum streaming method (DMStream). Our methods leverage inexact deflation and are capable of achieving near-optimal convergence with far less restrictive hyperparameter requirements. We provide convergence analyses for both algorithms through the lens of perturbation theory. Further, we experimentally demonstrate that DMPower routinely outperforms the vanilla power method and that both algorithms match the convergence speed of an oracle running existing accelerated methods with perfect spectral knowledge.
△ Less
Submitted 20 August, 2021;
originally announced August 2021.
-
Constructions of difference sets in nonabelian 2-groups
Authors:
Taylor Applebaum,
John Clikeman,
James A. Davis,
John F. Dillon,
Jonathan Jedwab,
Tahseen Rabbani,
Ken Smith,
William Yolland
Abstract:
Difference sets have been studied for more than 80 years. Techniques from algebraic number theory, group theory, finite geometry, and digital communications engineering have been used to establish constructive and nonexistence results. We provide a new theoretical approach which dramatically expands the class of $2$-groups known to contain a difference set, by refining the concept of covering exte…
▽ More
Difference sets have been studied for more than 80 years. Techniques from algebraic number theory, group theory, finite geometry, and digital communications engineering have been used to establish constructive and nonexistence results. We provide a new theoretical approach which dramatically expands the class of $2$-groups known to contain a difference set, by refining the concept of covering extended building sets introduced by Davis and Jedwab in 1997. We then describe how product constructions and other methods can be used to construct difference sets in some of the remaining $2$-groups. We announce the completion of ten years of collaborative work to determine precisely which of the 56,092 nonisomorphic groups of order 256 contain a difference set. All groups of order 256 not excluded by the two classical nonexistence criteria are found to contain a difference set, in agreement with previous findings for groups of order 4, 16, and 64. We provide suggestions for how the existence question for difference sets in $2$-groups of all orders might be resolved.
△ Less
Submitted 13 January, 2022; v1 submitted 2 April, 2020;
originally announced April 2020.
-
Safe Feature Elimination for the LASSO and Sparse Supervised Learning Problems
Authors:
Laurent El Ghaoui,
Vivian Viallon,
Tarek Rabbani
Abstract:
We describe a fast method to eliminate features (variables) in l1 -penalized least-square regression (or LASSO) problems. The elimination of features leads to a potentially substantial reduction in running time, specially for large values of the penalty parameter. Our method is not heuristic: it only eliminates features that are guaranteed to be absent after solving the LASSO problem. The feature…
▽ More
We describe a fast method to eliminate features (variables) in l1 -penalized least-square regression (or LASSO) problems. The elimination of features leads to a potentially substantial reduction in running time, specially for large values of the penalty parameter. Our method is not heuristic: it only eliminates features that are guaranteed to be absent after solving the LASSO problem. The feature elimination step is easy to parallelize and can test each feature for elimination independently. Moreover, the computational effort of our method is negligible compared to that of solving the LASSO problem - roughly it is the same as single gradient step. Our method extends the scope of existing LASSO algorithms to treat larger data sets, previously out of their reach. We show how our method can be extended to general l1 -penalized convex problems and present preliminary results for the Sparse Support Vector Machine and Logistic Regression problems.
△ Less
Submitted 18 May, 2011; v1 submitted 21 September, 2010;
originally announced September 2010.
-
Safe Feature Elimination in Sparse Supervised Learning
Authors:
Laurent El Ghaoui,
Vivian Viallon,
Tarek Rabbani
Abstract:
We investigate fast methods that allow to quickly eliminate variables (features) in supervised learning problems involving a convex loss function and a $l_1$-norm penalty, leading to a potentially substantial reduction in the number of variables prior to running the supervised learning algorithm. The methods are not heuristic: they only eliminate features that are {\em guaranteed} to be absent aft…
▽ More
We investigate fast methods that allow to quickly eliminate variables (features) in supervised learning problems involving a convex loss function and a $l_1$-norm penalty, leading to a potentially substantial reduction in the number of variables prior to running the supervised learning algorithm. The methods are not heuristic: they only eliminate features that are {\em guaranteed} to be absent after solving the learning problem. Our framework applies to a large class of problems, including support vector machine classification, logistic regression and least-squares.
The complexity of the feature elimination step is negligible compared to the typical computational effort involved in the sparse supervised learning problem: it grows linearly with the number of features times the number of examples, with much better count if data is sparse. We apply our method to data sets arising in text classification and observe a dramatic reduction of the dimensionality, hence in computational effort required to solve the learning problem, especially when very sparse classifiers are sought. Our method allows to immediately extend the scope of existing algorithms, allowing us to run them on data sets of sizes that were out of their reach before.
△ Less
Submitted 26 October, 2010; v1 submitted 17 September, 2010;
originally announced September 2010.