Search | arXiv e-print repository

Federated Minimax Optimization: Improved Convergence Analyses and Algorithms

Authors: Pranay Sharma, Rohan Panda, Gauri Joshi, Pramod K. Varshney

Abstract: In this paper, we consider nonconvex minimax optimization, which is gaining prominence in many modern machine learning applications such as GANs. Large-scale edge-based collection of training data in these applications calls for communication-efficient distributed optimization algorithms, such as those used in federated learning, to process the data. In this paper, we analyze Local stochastic grad… ▽ More In this paper, we consider nonconvex minimax optimization, which is gaining prominence in many modern machine learning applications such as GANs. Large-scale edge-based collection of training data in these applications calls for communication-efficient distributed optimization algorithms, such as those used in federated learning, to process the data. In this paper, we analyze Local stochastic gradient descent ascent (SGDA), the local-update version of the SGDA algorithm. SGDA is the core algorithm used in minimax optimization, but it is not well-understood in a distributed setting. We prove that Local SGDA has \textit{order-optimal} sample complexity for several classes of nonconvex-concave and nonconvex-nonconcave minimax problems, and also enjoys \textit{linear speedup} with respect to the number of clients. We provide a novel and tighter analysis, which improves the convergence and communication guarantees in the existing literature. For nonconvex-PL and nonconvex-one-point-concave functions, we improve the existing complexity results for centralized minimax problems. Furthermore, we propose a momentum-based local-update algorithm, which has the same convergence guarantees, but outperforms Local SGDA as demonstrated in our experiments. △ Less

Submitted 9 March, 2022; originally announced March 2022.

Comments: 52 pages, 4 figures

arXiv:2106.10435 [pdf, other]

STEM: A Stochastic Two-Sided Momentum Algorithm Achieving Near-Optimal Sample and Communication Complexities for Federated Learning

Authors: Prashant Khanduri, Pranay Sharma, Haibo Yang, Mingyi Hong, Jia Liu, Ketan Rajawat, Pramod K. Varshney

Abstract: Federated Learning (FL) refers to the paradigm where multiple worker nodes (WNs) build a joint model by using local data. Despite extensive research, for a generic non-convex FL problem, it is not clear, how to choose the WNs' and the server's update directions, the minibatch sizes, and the local update frequency, so that the WNs use the minimum number of samples and communication rounds to achiev… ▽ More Federated Learning (FL) refers to the paradigm where multiple worker nodes (WNs) build a joint model by using local data. Despite extensive research, for a generic non-convex FL problem, it is not clear, how to choose the WNs' and the server's update directions, the minibatch sizes, and the local update frequency, so that the WNs use the minimum number of samples and communication rounds to achieve the desired solution. This work addresses the above question and considers a class of stochastic algorithms where the WNs perform a few local updates before communication. We show that when both the WN's and the server's directions are chosen based on a stochastic momentum estimator, the algorithm requires $\tilde{\mathcal{O}}(ε^{-3/2})$ samples and $\tilde{\mathcal{O}}(ε^{-1})$ communication rounds to compute an $ε$-stationary solution. To the best of our knowledge, this is the first FL algorithm that achieves such {\it near-optimal} sample and communication complexities simultaneously. Further, we show that there is a trade-off curve between local update frequencies and local minibatch sizes, on which the above sample and communication complexities can be maintained. Finally, we show that for the classical FedAvg (a.k.a. Local SGD, which is a momentum-less special case of the STEM), a similar trade-off curve exists, albeit with worse sample and communication complexities. Our insights on this trade-off provides guidelines for choosing the four important design elements for FL algorithms, the update frequency, directions, and minibatch sizes to achieve the best performance. △ Less

Submitted 19 June, 2021; originally announced June 2021.

arXiv:2012.11518 [pdf, other]

Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box Optimization Framework

Authors: Pranay Sharma, Kaidi Xu, Sijia Liu, Pin-Yu Chen, Xue Lin, Pramod K. Varshney

Abstract: In this work, we focus on the study of stochastic zeroth-order (ZO) optimization which does not require first-order gradient information and uses only function evaluations. The problem of ZO optimization has emerged in many recent machine learning applications, where the gradient of the objective function is either unavailable or difficult to compute. In such cases, we can approximate the full gra… ▽ More In this work, we focus on the study of stochastic zeroth-order (ZO) optimization which does not require first-order gradient information and uses only function evaluations. The problem of ZO optimization has emerged in many recent machine learning applications, where the gradient of the objective function is either unavailable or difficult to compute. In such cases, we can approximate the full gradients or stochastic gradients through function value based gradient estimates. Here, we propose a novel hybrid gradient estimator (HGE), which takes advantage of the query-efficiency of random gradient estimates as well as the variance-reduction of coordinate-wise gradient estimates. We show that with a graceful design in coordinate importance sampling, the proposed HGE-based ZO optimization method is efficient both in terms of iteration complexity as well as function query cost. We provide a thorough theoretical analysis of the convergence of our proposed method for non-convex, convex, and strongly-convex optimization. We show that the convergence rate that we derive generalizes the results for some prominent existing methods in the nonconvex case, and matches the optimal result in the convex case. We also corroborate the theory with a real-world black-box attack generation application to demonstrate the empirical advantage of our method over state-of-the-art ZO optimization approaches. △ Less

Submitted 21 December, 2020; originally announced December 2020.

Comments: 27 pages, 3 figures

arXiv:2005.00224 [pdf, ps, other]

Distributed Stochastic Non-Convex Optimization: Momentum-Based Variance Reduction

Authors: Prashant Khanduri, Pranay Sharma, Swatantra Kafle, Saikiran Bulusu, Ketan Rajawat, Pramod K. Varshney

Abstract: In this work, we propose a distributed algorithm for stochastic non-convex optimization. We consider a worker-server architecture where a set of $K$ worker nodes (WNs) in collaboration with a server node (SN) jointly aim to minimize a global, potentially non-convex objective function. The objective function is assumed to be the sum of local objective functions available at each WN, with each node… ▽ More In this work, we propose a distributed algorithm for stochastic non-convex optimization. We consider a worker-server architecture where a set of $K$ worker nodes (WNs) in collaboration with a server node (SN) jointly aim to minimize a global, potentially non-convex objective function. The objective function is assumed to be the sum of local objective functions available at each WN, with each node having access to only the stochastic samples of its local objective function. In contrast to the existing approaches, we employ a momentum based "single loop" distributed algorithm which eliminates the need of computing large batch size gradients to achieve variance reduction. We propose two algorithms one with "adaptive" and the other with "non-adaptive" learning rates. We show that the proposed algorithms achieve the optimal computational complexity while attaining linear speedup with the number of WNs. Specifically, the algorithms reach an $ε$-stationary point $x_a$ with $\mathbb{E}\| \nabla f(x_a) \| \leq \tilde{O}(K^{-1/3}T^{-1/2} + K^{-1/3}T^{-1/3})$ in $T$ iterations, thereby requiring $\tilde{O}(K^{-1} ε^{-3})$ gradient computations at each WN. Moreover, our approach does not assume identical data distributions across WNs making the approach general enough for federated learning applications. △ Less

Submitted 1 May, 2020; originally announced May 2020.

arXiv:2001.03166 [pdf, ps, other]

On Distributed Online Convex Optimization with Sublinear Dynamic Regret and Fit

Authors: Pranay Sharma, Prashant Khanduri, Lixin Shen, Donald J. Bucci Jr., Pramod K. Varshney

Abstract: In this work, we consider a distributed online convex optimization problem, with time-varying (potentially adversarial) constraints. A set of nodes, jointly aim to minimize a global objective function, which is the sum of local convex functions. The objective and constraint functions are revealed locally to the nodes, at each time, after taking an action. Naturally, the constraints cannot be insta… ▽ More In this work, we consider a distributed online convex optimization problem, with time-varying (potentially adversarial) constraints. A set of nodes, jointly aim to minimize a global objective function, which is the sum of local convex functions. The objective and constraint functions are revealed locally to the nodes, at each time, after taking an action. Naturally, the constraints cannot be instantaneously satisfied. Therefore, we reformulate the problem to satisfy these constraints in the long term. To this end, we propose a distributed primal-dual mirror descent based approach, in which the primal and dual updates are carried out locally at all the nodes. This is followed by sharing and mixing of the primal variables by the local nodes via communication with the immediate neighbors. To quantify the performance of the proposed algorithm, we utilize the challenging, but more realistic metrics of dynamic regret and fit. Dynamic regret measures the cumulative loss incurred by the algorithm, compared to the best dynamic strategy. On the other hand, fit measures the long term cumulative constraint violations. Without assuming the restrictive Slater's conditions, we show that the proposed algorithm achieves sublinear regret and fit under mild, commonly used assumptions. △ Less

Submitted 5 May, 2021; v1 submitted 9 January, 2020; originally announced January 2020.

Comments: 22 pages

arXiv:1912.06036 [pdf, ps, other]

Parallel Restarted SPIDER -- Communication Efficient Distributed Nonconvex Optimization with Optimal Computation Complexity

Authors: Pranay Sharma, Swatantra Kafle, Prashant Khanduri, Saikiran Bulusu, Ketan Rajawat, Pramod K. Varshney

Abstract: In this paper, we propose a distributed algorithm for stochastic smooth, non-convex optimization. We assume a worker-server architecture where $N$ nodes, each having $n$ (potentially infinite) number of samples, collaborate with the help of a central server to perform the optimization task. The global objective is to minimize the average of local cost functions available at individual nodes. The p… ▽ More In this paper, we propose a distributed algorithm for stochastic smooth, non-convex optimization. We assume a worker-server architecture where $N$ nodes, each having $n$ (potentially infinite) number of samples, collaborate with the help of a central server to perform the optimization task. The global objective is to minimize the average of local cost functions available at individual nodes. The proposed approach is a non-trivial extension of the popular parallel-restarted SGD algorithm, incorporating the optimal variance-reduction based SPIDER gradient estimator into it. We prove convergence of our algorithm to a first-order stationary solution. The proposed approach achieves the best known communication complexity $O(ε^{-1})$ along with the optimal computation complexity. For finite-sum problems (finite $n$), we achieve the optimal computation (IFO) complexity $O(\sqrt{Nn}ε^{-1})$. For online problems ($n$ unknown or infinite), we achieve the optimal IFO complexity $O(ε^{-3/2})$. In both the cases, we maintain the linear speedup achieved by existing methods. This is a massive improvement over the $O(ε^{-2})$ IFO complexity of the existing approaches. Additionally, our algorithm is general enough to allow non-identical distributions of data across workers, as in the recently proposed federated learning paradigm. △ Less

Submitted 6 November, 2020; v1 submitted 12 December, 2019; originally announced December 2019.

arXiv:1912.04531 [pdf, ps, other]

Byzantine Resilient Non-Convex SVRG with Distributed Batch Gradient Computations

Authors: Prashant Khanduri, Saikiran Bulusu, Pranay Sharma, Pramod K. Varshney

Abstract: In this work, we consider the distributed stochastic optimization problem of minimizing a non-convex function $f(x) = \mathbb{E}_{ξ\sim \mathcal{D}} f(x; ξ)$ in an adversarial setting, where the individual functions $f(x; ξ)$ can also be potentially non-convex. We assume that at most $α$-fraction of a total of $K$ nodes can be Byzantines. We propose a robust stochastic variance-reduced gradient (S… ▽ More In this work, we consider the distributed stochastic optimization problem of minimizing a non-convex function $f(x) = \mathbb{E}_{ξ\sim \mathcal{D}} f(x; ξ)$ in an adversarial setting, where the individual functions $f(x; ξ)$ can also be potentially non-convex. We assume that at most $α$-fraction of a total of $K$ nodes can be Byzantines. We propose a robust stochastic variance-reduced gradient (SVRG) like algorithm for the problem, where the batch gradients are computed at the worker nodes (WNs) and the stochastic gradients are computed at the server node (SN). For the non-convex optimization problem, we show that we need $\tilde{O}\left( \frac{1}{ε^{5/3} K^{2/3}} + \frac{α^{4/3}}{ε^{5/3}} \right)$ gradient computations on average at each node (SN and WNs) to reach an $ε$-stationary point. The proposed algorithm guarantees convergence via the design of a novel Byzantine filtering rule which is independent of the problem dimension. Importantly, we capture the effect of the fraction of Byzantine nodes $α$ present in the network on the convergence performance of the algorithm. △ Less

Submitted 10 December, 2019; originally announced December 2019.

Comments: Optimization for Machine Learning, 2019

arXiv:1410.5904 [pdf, ps, other]

Distributed Detection in Tree Networks: Byzantines and Mitigation Techniques

Authors: Bhavya Kailkhura, Swastik Brahma, Berkan Dulek, Yunghsiang S Han, Pramod K. Varshney

Abstract: In this paper, the problem of distributed detection in tree networks in the presence of Byzantines is considered. Closed form expressions for optimal attacking strategies that minimize the miss detection error exponent at the fusion center (FC) are obtained. We also look at the problem from the network designer's (FC's) perspective. We study the problem of designing optimal distributed detection p… ▽ More In this paper, the problem of distributed detection in tree networks in the presence of Byzantines is considered. Closed form expressions for optimal attacking strategies that minimize the miss detection error exponent at the fusion center (FC) are obtained. We also look at the problem from the network designer's (FC's) perspective. We study the problem of designing optimal distributed detection parameters in a tree network in the presence of Byzantines. Next, we model the strategic interaction between the FC and the attacker as a Leader-Follower (Stackelberg) game. This formulation provides a methodology for predicting attacker and defender (FC) equilibrium strategies, which can be used to implement the optimal detector. Finally, a reputation based scheme to identify Byzantines is proposed and its performance is analytically evaluated. We also provide some numerical examples to gain insights into the solution. △ Less

Submitted 21 October, 2014; originally announced October 2014.

arXiv:1311.2448 [pdf, ps, other]

Recovery of Sparse Matrices via Matrix Sketching

Authors: Thakshila Wimalajeewa, Yonina C. Eldar, Pramod K. Varshney

Abstract: In this paper, we consider the problem of recovering an unknown sparse matrix X from the matrix sketch Y = AX B^T. The dimension of Y is less than that of X, and A and B are known matrices. This problem can be solved using standard compressive sensing (CS) theory after converting it to vector form using the Kronecker operation. In this case, the measurement matrix assumes a Kronecker product struc… ▽ More In this paper, we consider the problem of recovering an unknown sparse matrix X from the matrix sketch Y = AX B^T. The dimension of Y is less than that of X, and A and B are known matrices. This problem can be solved using standard compressive sensing (CS) theory after converting it to vector form using the Kronecker operation. In this case, the measurement matrix assumes a Kronecker product structure. However, as the matrix dimension increases the associated computational complexity makes its use prohibitive. We extend two algorithms, fast iterative shrinkage threshold algorithm (FISTA) and orthogonal matching pursuit (OMP) to solve this problem in matrix form without employing the Kronecker product. While both FISTA and OMP with matrix inputs are shown to be equivalent in performance to their vector counterparts with the Kronecker product, solving them in matrix form is shown to be computationally more efficient. We show that the computational gain achieved by FISTA with matrix inputs over its vector form is more significant compared to that achieved by OMP. △ Less

Submitted 11 November, 2013; originally announced November 2013.

arXiv:1309.4513 [pdf, ps, other]

doi 10.1109/TSP.2014.2321735

Distributed Detection in Tree Topologies with Byzantines

Authors: Bhavya Kailkhura, Swastik Brahma, Yunghsiang S. Han, Pramod K. Varshney

Abstract: In this paper, we consider the problem of distributed detection in tree topologies in the presence of Byzantines. The expression for minimum attacking power required by the Byzantines to blind the fusion center (FC) is obtained. More specifically, we show that when more than a certain fraction of individual node decisions are falsified, the decision fusion scheme becomes completely incapable. We o… ▽ More In this paper, we consider the problem of distributed detection in tree topologies in the presence of Byzantines. The expression for minimum attacking power required by the Byzantines to blind the fusion center (FC) is obtained. More specifically, we show that when more than a certain fraction of individual node decisions are falsified, the decision fusion scheme becomes completely incapable. We obtain closed form expressions for the optimal attacking strategies that minimize the detection error exponent at the FC. We also look at the possible counter-measures from the FC's perspective to protect the network from these Byzantines. We formulate the robust topology design problem as a bi-level program and provide an efficient algorithm to solve it. We also provide some numerical results to gain insights into the solution. △ Less

Submitted 17 September, 2013; originally announced September 2013.

arXiv:1302.1616 [pdf, ps, other]

doi 10.1109/TSP.2013.2289881

Sensor Selection Based on Generalized Information Gain for Target Tracking in Large Sensor Networks

Authors: Xiaojing Shen, Pramod K. Varshney

Abstract: In this paper, sensor selection problems for target tracking in large sensor networks with linear equality or inequality constraints are considered. First, we derive an equivalent Kalman filter for sensor selection, i.e., generalized information filter. Then, under a regularity condition, we prove that the multistage look-ahead policy that minimizes either the final or the average estimation error… ▽ More In this paper, sensor selection problems for target tracking in large sensor networks with linear equality or inequality constraints are considered. First, we derive an equivalent Kalman filter for sensor selection, i.e., generalized information filter. Then, under a regularity condition, we prove that the multistage look-ahead policy that minimizes either the final or the average estimation error covariances of next multiple time steps is equivalent to a myopic sensor selection policy that maximizes the trace of the generalized information gain at each time step. Moreover, when the measurement noises are uncorrelated between sensors, the optimal solution can be obtained analytically for sensor selection when constraints are temporally separable. When constraints are temporally inseparable, sensor selections can be obtained by approximately solving a linear programming problem so that the sensor selection problem for a large sensor network can be dealt with quickly. Although there is no guarantee that the gap between the performance of the chosen subset and the performance bound is always small, numerical examples suggest that the algorithm is near-optimal in many cases. Finally, when the measurement noises are correlated between sensors, the sensor selection problem with temporally inseparable constraints can be relaxed to a Boolean quadratic programming problem which can be efficiently solved by a Gaussian randomization procedure along with solving a semi-definite programming problem. Numerical examples show that the proposed method is much better than the method that ignores dependence of noises. △ Less

Submitted 6 February, 2013; originally announced February 2013.

Comments: 38 pages, 14 figures, submitted to Journal

arXiv:1211.6719 [pdf, ps, other]

Cooperative Sparsity Pattern Recovery in Distributed Networks Via Distributed-OMP

Authors: Thakshila Wimalajeewa, Pramod K. Varshney

Abstract: In this paper, we consider the problem of collaboratively estimating the sparsity pattern of a sparse signal with multiple measurement data in distributed networks. We assume that each node makes Compressive Sensing (CS) based measurements via random projections regarding the same sparse signal. We propose a distributed greedy algorithm based on Orthogonal Matching Pursuit (OMP), in which the spar… ▽ More In this paper, we consider the problem of collaboratively estimating the sparsity pattern of a sparse signal with multiple measurement data in distributed networks. We assume that each node makes Compressive Sensing (CS) based measurements via random projections regarding the same sparse signal. We propose a distributed greedy algorithm based on Orthogonal Matching Pursuit (OMP), in which the sparse support is estimated iteratively while fusing indices estimated at distributed nodes. In the proposed distributed framework, each node has to perform less number of iterations of OMP compared to the sparsity index of the sparse signal. Thus, with each node having a very small number of compressive measurements, a significant performance gain in support recovery is achieved via the proposed collaborative scheme compared to the case where each node estimates the sparsity pattern independently and then fusion is performed to get a global estimate. We further extend the algorithm to estimate the sparsity pattern in a binary hypothesis testing framework, where the algorithm first detects the presence of a sparse signal collaborating among nodes with a fewer number of iterations of OMP and then increases the number of iterations to estimate the sparsity pattern only if the signal is detected. △ Less

Submitted 28 November, 2012; originally announced November 2012.

arXiv:0908.2954 [pdf, other]

Approximation of Average Run Length of Moving Sum Algorithms Using Multivariate Probabilities

Authors: Swarnendu Kar, Kishan G. Mehrotra, Pramod K. Varshney

Abstract: Among the various procedures used to detect potential changes in a stochastic process the moving sum algorithms are very popular due to their intuitive appeal and good statistical performance. One of the important design parameters of a change detection algorithm is the expected interval between false positives, also known as the average run length (ARL). Computation of the ARL usually involves… ▽ More Among the various procedures used to detect potential changes in a stochastic process the moving sum algorithms are very popular due to their intuitive appeal and good statistical performance. One of the important design parameters of a change detection algorithm is the expected interval between false positives, also known as the average run length (ARL). Computation of the ARL usually involves numerical procedures but in some cases it can be approximated using a series involving multivariate probabilities. In this paper, we present an analysis of this series approach by providing sufficient conditions for convergence and derive an error bound. Using simulation studies, we show that the series approach is applicable to moving average and filtered derivative algorithms. For moving average algorithms, we compare our results with previously known bounds. We use two special cases to illustrate our observations. △ Less

Submitted 20 August, 2009; originally announced August 2009.

Showing 1–13 of 13 results for author: Varshney, P K