Search | arXiv e-print repository

Local Linear Convergence of Infeasible Optimization with Orthogonal Constraints

Authors: Youbang Sun, Shixiang Chen, Alfredo Garcia, Shahin Shahrampour

Abstract: Many classical and modern machine learning algorithms require solving optimization tasks under orthogonality constraints. Solving these tasks with feasible methods requires a gradient descent update followed by a retraction operation on the Stiefel manifold, which can be computationally expensive. Recently, an infeasible retraction-free approach, termed the landing algorithm, was proposed as an ef… ▽ More Many classical and modern machine learning algorithms require solving optimization tasks under orthogonality constraints. Solving these tasks with feasible methods requires a gradient descent update followed by a retraction operation on the Stiefel manifold, which can be computationally expensive. Recently, an infeasible retraction-free approach, termed the landing algorithm, was proposed as an efficient alternative. Motivated by the common occurrence of orthogonality constraints in tasks such as principle component analysis and training of deep neural networks, this paper studies the landing algorithm and establishes a novel linear convergence rate for smooth non-convex functions using only a local Riemannian PŁ condition. Numerical experiments demonstrate that the landing algorithm performs on par with the state-of-the-art retraction-based methods with substantially reduced computational overhead. △ Less

Submitted 7 December, 2024; originally announced December 2024.

arXiv:2406.01484 [pdf, other]

Online Optimization Perspective on First-Order and Zero-Order Decentralized Nonsmooth Nonconvex Stochastic Optimization

Authors: Emre Sahinoglu, Shahin Shahrampour

Abstract: We investigate the finite-time analysis of finding ($δ,ε$)-stationary points for nonsmooth nonconvex objectives in decentralized stochastic optimization. A set of agents aim at minimizing a global function using only their local information by interacting over a network. We present a novel algorithm, called Multi Epoch Decentralized Online Learning (ME-DOL), for which we establish the sample compl… ▽ More We investigate the finite-time analysis of finding ($δ,ε$)-stationary points for nonsmooth nonconvex objectives in decentralized stochastic optimization. A set of agents aim at minimizing a global function using only their local information by interacting over a network. We present a novel algorithm, called Multi Epoch Decentralized Online Learning (ME-DOL), for which we establish the sample complexity in various settings. First, using a recently proposed online-to-nonconvex technique, we show that our algorithm recovers the optimal convergence rate of smooth nonconvex objectives. We then extend our analysis to the nonsmooth setting, building on properties of randomized smoothing and Goldstein-subdifferential sets. We establish the sample complexity of $O(δ^{-1}ε^{-3})$, which to the best of our knowledge is the first finite-time guarantee for decentralized nonsmooth nonconvex stochastic optimization in the first-order setting (without weak-convexity), matching its optimal centralized counterpart. We further prove the same rate for the zero-order oracle setting without using variance reduction. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: To appear in ICML 2024

arXiv:2405.11590 [pdf, other]

Retraction-Free Decentralized Non-convex Optimization with Orthogonal Constraints

Authors: Youbang Sun, Shixiang Chen, Alfredo Garcia, Shahin Shahrampour

Abstract: In this paper, we investigate decentralized non-convex optimization with orthogonal constraints. Conventional algorithms for this setting require either manifold retractions or other types of projection to ensure feasibility, both of which involve costly linear algebra operations (e.g., SVD or matrix inversion). On the other hand, infeasible methods are able to provide similar performance with hig… ▽ More In this paper, we investigate decentralized non-convex optimization with orthogonal constraints. Conventional algorithms for this setting require either manifold retractions or other types of projection to ensure feasibility, both of which involve costly linear algebra operations (e.g., SVD or matrix inversion). On the other hand, infeasible methods are able to provide similar performance with higher computational efficiency. Inspired by this, we propose the first decentralized version of the retraction-free landing algorithm, called \textbf{D}ecentralized \textbf{R}etraction-\textbf{F}ree \textbf{G}radient \textbf{T}racking (DRFGT). We theoretically prove that DRFGT enjoys the ergodic convergence rate of $\mathcal{O}(1/K)$, matching the convergence rate of centralized, retraction-based methods. We further establish that under a local Riemannian PŁ condition, DRFGT achieves a much faster linear convergence rate. Numerical experiments demonstrate that DRFGT performs on par with the state-of-the-art retraction-based methods with substantially reduced computational overhead. △ Less

Submitted 7 December, 2024; v1 submitted 19 May, 2024; originally announced May 2024.

arXiv:2405.02769 [pdf, other]

Linear Convergence of Independent Natural Policy Gradient in Games with Entropy Regularization

Authors: Youbang Sun, Tao Liu, P. R. Kumar, Shahin Shahrampour

Abstract: This work focuses on the entropy-regularized independent natural policy gradient (NPG) algorithm in multi-agent reinforcement learning. In this work, agents are assumed to have access to an oracle with exact policy evaluation and seek to maximize their respective independent rewards. Each individual's reward is assumed to depend on the actions of all the agents in the multi-agent system, leading t… ▽ More This work focuses on the entropy-regularized independent natural policy gradient (NPG) algorithm in multi-agent reinforcement learning. In this work, agents are assumed to have access to an oracle with exact policy evaluation and seek to maximize their respective independent rewards. Each individual's reward is assumed to depend on the actions of all the agents in the multi-agent system, leading to a game between agents. We assume all agents make decisions under a policy with bounded rationality, which is enforced by the introduction of entropy regularization. In practice, a smaller regularization implies the agents are more rational and behave closer to Nash policies. On the other hand, agents with larger regularization acts more randomly, which ensures more exploration. We show that, under sufficient entropy regularization, the dynamics of this system converge at a linear rate to the quantal response equilibrium (QRE). Although regularization assumptions prevent the QRE from approximating a Nash equilibrium, our findings apply to a wide range of games, including cooperative, potential, and two-player matrix games. We also provide extensive empirical results on multiple games (including Markov games) as a verification of our theoretical analysis. △ Less

Submitted 4 May, 2024; originally announced May 2024.

arXiv:2403.08553 [pdf, other]

Regret Analysis of Policy Optimization over Submanifolds for Linearly Constrained Online LQG

Authors: Ting-Jui Chang, Shahin Shahrampour

Abstract: Recent advancement in online optimization and control has provided novel tools to study online linear quadratic regulator (LQR) problems, where cost matrices are varying adversarially over time. However, the controller parameterization of existing works may not satisfy practical conditions like sparsity due to physical connections. In this work, we study online linear quadratic Gaussian problems w… ▽ More Recent advancement in online optimization and control has provided novel tools to study online linear quadratic regulator (LQR) problems, where cost matrices are varying adversarially over time. However, the controller parameterization of existing works may not satisfy practical conditions like sparsity due to physical connections. In this work, we study online linear quadratic Gaussian problems with a given linear constraint imposed on the controller. Inspired by the recent work of [1] which proposed, for a linearly constrained policy optimization of an offline LQR, a second order method equipped with a Riemannian metric that emerges naturally in the context of optimal control problems, we propose online optimistic Newton on manifold (OONM) which provides an online controller based on the prediction on the first and second order information of the function sequence. To quantify the proposed algorithm, we leverage the notion of regret defined as the sub-optimality of its cumulative cost to that of a (locally) minimizing controller sequence and provide the regret bound in terms of the path-length of the minimizer sequence. Simulation results are also provided to verify the property of OONM. △ Less

Submitted 13 March, 2024; originally announced March 2024.

arXiv:2403.07207 [pdf, other]

Tracking Dynamic Gaussian Density with a Theoretically Optimal Sliding Window Approach

Authors: Yinsong Wang, Yu Ding, Shahin Shahrampour

Abstract: Dynamic density estimation is ubiquitous in many applications, including computer vision and signal processing. One popular method to tackle this problem is the "sliding window" kernel density estimator. There exist various implementations of this method that use heuristically defined weight sequences for the observed data. The weight sequence, however, is a key aspect of the estimator affecting t… ▽ More Dynamic density estimation is ubiquitous in many applications, including computer vision and signal processing. One popular method to tackle this problem is the "sliding window" kernel density estimator. There exist various implementations of this method that use heuristically defined weight sequences for the observed data. The weight sequence, however, is a key aspect of the estimator affecting the tracking performance significantly. In this work, we study the exact mean integrated squared error (MISE) of "sliding window" Gaussian Kernel Density Estimators for evolving Gaussian densities. We provide a principled guide for choosing the optimal weight sequence by theoretically characterizing the exact MISE, which can be formulated as constrained quadratic programming. We present empirical evidence with synthetic datasets to show that our weighting scheme indeed improves the tracking performance compared to heuristic approaches. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2310.09727 [pdf, other]

Provably Fast Convergence of Independent Natural Policy Gradient for Markov Potential Games

Authors: Youbang Sun, Tao Liu, Ruida Zhou, P. R. Kumar, Shahin Shahrampour

Abstract: This work studies an independent natural policy gradient (NPG) algorithm for the multi-agent reinforcement learning problem in Markov potential games. It is shown that, under mild technical assumptions and the introduction of the \textit{suboptimality gap}, the independent NPG method with an oracle providing exact policy evaluation asymptotically reaches an $ε$-Nash Equilibrium (NE) within… ▽ More This work studies an independent natural policy gradient (NPG) algorithm for the multi-agent reinforcement learning problem in Markov potential games. It is shown that, under mild technical assumptions and the introduction of the \textit{suboptimality gap}, the independent NPG method with an oracle providing exact policy evaluation asymptotically reaches an $ε$-Nash Equilibrium (NE) within $\mathcal{O}(1/ε)$ iterations. This improves upon the previous best result of $\mathcal{O}(1/ε^2)$ iterations and is of the same order, $\mathcal{O}(1/ε)$, that is achievable for the single-agent case. Empirical results for a synthetic potential game and a congestion game are presented to verify the theoretical bounds. △ Less

Submitted 27 October, 2023; v1 submitted 15 October, 2023; originally announced October 2023.

Comments: Will appear in NeurIPS 2023

arXiv:2310.03206 [pdf, other]

Regret Analysis of Distributed Online Control for LTI Systems with Adversarial Disturbances

Authors: Ting-Jui Chang, Shahin Shahrampour

Abstract: This paper addresses the distributed online control problem over a network of linear time-invariant (LTI) systems (with possibly unknown dynamics) in the presence of adversarial perturbations. There exists a global network cost that is characterized by a time-varying convex function, which evolves in an adversarial manner and is sequentially and partially observed by local agents. The goal of each… ▽ More This paper addresses the distributed online control problem over a network of linear time-invariant (LTI) systems (with possibly unknown dynamics) in the presence of adversarial perturbations. There exists a global network cost that is characterized by a time-varying convex function, which evolves in an adversarial manner and is sequentially and partially observed by local agents. The goal of each agent is to generate a control sequence that can compete with the best centralized control policy in hindsight, which has access to the global cost. This problem is formulated as a regret minimization. For the case of known dynamics, we propose a fully distributed disturbance feedback controller that guarantees a regret bound of $O(\sqrt{T}\log T)$, where $T$ is the time horizon. For the unknown dynamics case, we design a distributed explore-then-commit approach, where in the exploration phase all agents jointly learn the system dynamics, and in the learning phase our proposed control algorithm is applied using each agent system estimate. We establish a regret bound of $O(T^{2/3} \text{poly}(\log T))$ for this setting. △ Less

Submitted 4 October, 2023; originally announced October 2023.

arXiv:2302.12320 [pdf, other]

Dynamic Regret Analysis of Safe Distributed Online Optimization for Convex and Non-convex Problems

Authors: Ting-Jui Chang, Sapana Chaudhary, Dileep Kalathil, Shahin Shahrampour

Abstract: This paper addresses safe distributed online optimization over an unknown set of linear safety constraints. A network of agents aims at jointly minimizing a global, time-varying function, which is only partially observable to each individual agent. Therefore, agents must engage in local communications to generate a safe sequence of actions competitive with the best minimizer sequence in hindsight,… ▽ More This paper addresses safe distributed online optimization over an unknown set of linear safety constraints. A network of agents aims at jointly minimizing a global, time-varying function, which is only partially observable to each individual agent. Therefore, agents must engage in local communications to generate a safe sequence of actions competitive with the best minimizer sequence in hindsight, and the gap between the two sequences is quantified via dynamic regret. We propose distributed safe online gradient descent (D-Safe-OGD) with an exploration phase, where all agents estimate the constraint parameters collaboratively to build estimated feasible sets, ensuring the action selection safety during the optimization phase. We prove that for convex functions, D-Safe-OGD achieves a dynamic regret bound of $O(T^{2/3} \sqrt{\log T} + T^{1/3}C_T^*)$, where $C_T^*$ denotes the path-length of the best minimizer sequence. We further prove a dynamic regret bound of $O(T^{2/3} \sqrt{\log T} + T^{2/3}C_T^*)$ for certain non-convex problems, which establishes the first dynamic regret bound for a safe distributed algorithm in the non-convex setting. △ Less

Submitted 23 February, 2023; originally announced February 2023.

arXiv:2302.02224 [pdf, other]

TAP: The Attention Patch for Cross-Modal Knowledge Transfer from Unlabeled Modality

Authors: Yinsong Wang, Shahin Shahrampour

Abstract: This paper addresses a cross-modal learning framework, where the objective is to enhance the performance of supervised learning in the primary modality using an unlabeled, unpaired secondary modality. Taking a probabilistic approach for missing information estimation, we show that the extra information contained in the secondary modality can be estimated via Nadaraya-Watson (NW) kernel regression,… ▽ More This paper addresses a cross-modal learning framework, where the objective is to enhance the performance of supervised learning in the primary modality using an unlabeled, unpaired secondary modality. Taking a probabilistic approach for missing information estimation, we show that the extra information contained in the secondary modality can be estimated via Nadaraya-Watson (NW) kernel regression, which can further be expressed as a kernelized cross-attention module (under linear transformation). This expression lays the foundation for introducing The Attention Patch (TAP), a simple neural network add-on that can be trained to allow data-level knowledge transfer from the unlabeled modality. We provide extensive numerical simulations using real-world datasets to show that TAP can provide statistically significant improvement in generalization across different domains and different neural network architectures, making use of seemingly unusable unlabeled cross-modal data. △ Less

Submitted 19 June, 2024; v1 submitted 4 February, 2023; originally announced February 2023.

Comments: Accepted to TMLR

arXiv:2209.12307 [pdf, other]

On the Stability Analysis of Open Federated Learning Systems

Authors: Youbang Sun, Heshan Fernando, Tianyi Chen, Shahin Shahrampour

Abstract: We consider the open federated learning (FL) systems, where clients may join and/or leave the system during the FL process. Given the variability of the number of present clients, convergence to a fixed model cannot be guaranteed in open systems. Instead, we resort to a new performance metric that we term the stability of open FL systems, which quantifies the magnitude of the learned model in open… ▽ More We consider the open federated learning (FL) systems, where clients may join and/or leave the system during the FL process. Given the variability of the number of present clients, convergence to a fixed model cannot be guaranteed in open systems. Instead, we resort to a new performance metric that we term the stability of open FL systems, which quantifies the magnitude of the learned model in open systems. Under the assumption that local clients' functions are strongly convex and smooth, we theoretically quantify the radius of stability for two FL algorithms, namely local SGD and local Adam. We observe that this radius relies on several key parameters, including the function condition number as well as the variance of the stochastic gradient. Our theoretical results are further verified by numerical simulations on both synthetic and real-world benchmark data-sets. △ Less

Submitted 12 March, 2023; v1 submitted 25 September, 2022; originally announced September 2022.

arXiv:2207.01062 [pdf, other]

Distributed Online System Identification for LTI Systems Using Reverse Experience Replay

Authors: Ting-Jui Chang, Shahin Shahrampour

Abstract: Identification of linear time-invariant (LTI) systems plays an important role in control and reinforcement learning. Both asymptotic and finite-time offline system identification are well-studied in the literature. For online system identification, the idea of stochastic-gradient descent with reverse experience replay (SGD-RER) was recently proposed, where the data sequence is stored in several bu… ▽ More Identification of linear time-invariant (LTI) systems plays an important role in control and reinforcement learning. Both asymptotic and finite-time offline system identification are well-studied in the literature. For online system identification, the idea of stochastic-gradient descent with reverse experience replay (SGD-RER) was recently proposed, where the data sequence is stored in several buffers and the stochastic-gradient descent (SGD) update performs backward in each buffer to break the time dependency between data points. Inspired by this work, we study distributed online system identification of LTI systems over a multi-agent network. We consider agents as identical LTI systems, and the network goal is to jointly estimate the system parameters by leveraging the communication between agents. We propose DSGD-RER, a distributed variant of the SGD-RER algorithm, and theoretically characterize the improvement of the estimation error with respect to the network size. Our numerical experiments certify the reduction of estimation error as the network size grows. △ Less

Submitted 15 September, 2022; v1 submitted 3 July, 2022; originally announced July 2022.

arXiv:2203.08317 [pdf, other]

TAKDE: Temporal Adaptive Kernel Density Estimator for Real-Time Dynamic Density Estimation

Authors: Yinsong Wang, Yu Ding, Shahin Shahrampour

Abstract: Real-time density estimation is ubiquitous in many applications, including computer vision and signal processing. Kernel density estimation is arguably one of the most commonly used density estimation techniques, and the use of "sliding window" mechanism adapts kernel density estimators to dynamic processes. In this paper, we derive the asymptotic mean integrated squared error (AMISE) upper bound… ▽ More Real-time density estimation is ubiquitous in many applications, including computer vision and signal processing. Kernel density estimation is arguably one of the most commonly used density estimation techniques, and the use of "sliding window" mechanism adapts kernel density estimators to dynamic processes. In this paper, we derive the asymptotic mean integrated squared error (AMISE) upper bound for the "sliding window" kernel density estimator. This upper bound provides a principled guide to devise a novel estimator, which we name the temporal adaptive kernel density estimator (TAKDE). Compared to heuristic approaches for "sliding window" kernel density estimator, TAKDE is theoretically optimal in terms of the worst-case AMISE. We provide numerical experiments using synthetic and real-world datasets, showing that TAKDE outperforms other state-of-the-art dynamic density estimators (including those outside of kernel family). In particular, TAKDE achieves a superior test log-likelihood with a smaller runtime. △ Less

Submitted 8 November, 2023; v1 submitted 15 March, 2022; originally announced March 2022.

arXiv:2112.05888 [pdf, other]

A Sparse Expansion For Deep Gaussian Processes

Authors: Liang Ding, Rui Tuo, Shahin Shahrampour

Abstract: In this work, we use Deep Gaussian Processes (DGPs) as statistical surrogates for stochastic processes with complex distributions. Conventional inferential methods for DGP models can suffer from high computational complexity as they require large-scale operations with kernel matrices for training and inference. In this work, we propose an efficient scheme for accurate inference and efficient train… ▽ More In this work, we use Deep Gaussian Processes (DGPs) as statistical surrogates for stochastic processes with complex distributions. Conventional inferential methods for DGP models can suffer from high computational complexity as they require large-scale operations with kernel matrices for training and inference. In this work, we propose an efficient scheme for accurate inference and efficient training based on a range of Gaussian Processes, called the Tensor Markov Gaussian Processes (TMGP). We construct an induced approximation of TMGP referred to as the hierarchical expansion. Next, we develop a deep TMGP (DTMGP) model as the composition of multiple hierarchical expansion of TMGPs. The proposed DTMGP model has the following properties: (1) the outputs of each activation function are deterministic while the weights are chosen independently from standard Gaussian distribution; (2) in training or prediction, only polylog(M) (out of M) activation functions have non-zero outputs, which significantly boosts the computational efficiency. Our numerical experiments on synthetic models and real datasets show the superior computational efficiency of DTMGP over existing DGP models. △ Less

Submitted 29 April, 2023; v1 submitted 10 December, 2021; originally announced December 2021.

arXiv:2105.14385 [pdf, other]

On Centralized and Distributed Mirror Descent: Convergence Analysis Using Quadratic Constraints

Authors: Youbang Sun, Mahyar Fazlyab, Shahin Shahrampour

Abstract: Mirror descent (MD) is a powerful first-order optimization technique that subsumes several optimization algorithms including gradient descent (GD). In this work, we develop a semi-definite programming (SDP) framework to analyze the convergence rate of MD in centralized and distributed settings under both strongly convex and non-strongly convex assumptions. We view MD with a dynamical system lens a… ▽ More Mirror descent (MD) is a powerful first-order optimization technique that subsumes several optimization algorithms including gradient descent (GD). In this work, we develop a semi-definite programming (SDP) framework to analyze the convergence rate of MD in centralized and distributed settings under both strongly convex and non-strongly convex assumptions. We view MD with a dynamical system lens and leverage quadratic constraints (QCs) to provide explicit convergence rates based on Lyapunov stability. For centralized MD under strongly convex assumption, we develop a SDP that certifies exponential convergence rates. We prove that the SDP always has a feasible solution that recovers the optimal GD rate as a special case. We complement our analysis by providing the $O(1/k)$ convergence rate for convex problems. Next, we analyze the convergence of distributed MD and characterize the rate using SDP. To the best of our knowledge, the numerical rate of distributed MD has not been previously reported in the literature. We further prove an $O(1/k)$ convergence rate for distributed MD in the convex setting. Our numerical experiments on strongly convex problems indicate that our framework certifies superior convergence rates compared to the existing rates for distributed GD. △ Less

Submitted 18 January, 2022; v1 submitted 29 May, 2021; originally announced May 2021.

arXiv:2105.07310 [pdf, other]

Regret Analysis of Distributed Online LQR Control for Unknown LTI Systems

Authors: Ting-Jui Chang, Shahin Shahrampour

Abstract: Online optimization has recently opened avenues to study optimal control for time-varying cost functions that are unknown in advance. Inspired by this line of research, we study the distributed online linear quadratic regulator (LQR) problem for linear time-invariant (LTI) systems with unknown dynamics. Consider a multi-agent network where each agent is modeled as a LTI system. The network has a g… ▽ More Online optimization has recently opened avenues to study optimal control for time-varying cost functions that are unknown in advance. Inspired by this line of research, we study the distributed online linear quadratic regulator (LQR) problem for linear time-invariant (LTI) systems with unknown dynamics. Consider a multi-agent network where each agent is modeled as a LTI system. The network has a global time-varying quadratic cost, which may evolve adversarially and is only partially observed by each agent sequentially. The goal of the network is to collectively (i) estimate the unknown dynamics and (ii) compute local control sequences competitive to the best centralized policy in hindsight, which minimizes the sum of network costs over time. This problem is formulated as a regret minimization. We propose a distributed variant of the online LQR algorithm, where agents compute their system estimates during an exploration stage. Each agent then applies distributed online gradient descent on a semi-definite programming (SDP) whose feasible set is based on the agent system estimate. We prove that with high probability the regret bound of our proposed algorithm scales as $O(T^{2/3}\log T)$, implying the consensus of all agents over time. We also provide simulation results verifying our theoretical guarantee. △ Less

Submitted 6 February, 2022; v1 submitted 15 May, 2021; originally announced May 2021.

Comments: arXiv admin note: substantial text overlap with arXiv:2009.13749

arXiv:2102.07091 [pdf, other]

Decentralized Riemannian Gradient Descent on the Stiefel Manifold

Authors: Shixiang Chen, Alfredo Garcia, Mingyi Hong, Shahin Shahrampour

Abstract: We consider a distributed non-convex optimization where a network of agents aims at minimizing a global function over the Stiefel manifold. The global function is represented as a finite sum of smooth local functions, where each local function is associated with one agent and agents communicate with each other over an undirected connected graph. The problem is non-convex as local functions are pos… ▽ More We consider a distributed non-convex optimization where a network of agents aims at minimizing a global function over the Stiefel manifold. The global function is represented as a finite sum of smooth local functions, where each local function is associated with one agent and agents communicate with each other over an undirected connected graph. The problem is non-convex as local functions are possibly non-convex (but smooth) and the Steifel manifold is a non-convex set. We present a decentralized Riemannian stochastic gradient method (DRSGD) with the convergence rate of $\mathcal{O}(1/\sqrt{K})$ to a stationary point. To have exact convergence with constant stepsize, we also propose a decentralized Riemannian gradient tracking algorithm (DRGTA) with the convergence rate of $\mathcal{O}(1/K)$ to a stationary point. We use multi-step consensus to preserve the iteration in the local (consensus) region. DRGTA is the first decentralized algorithm with exact convergence for distributed optimization on Stiefel manifold. △ Less

Submitted 14 February, 2021; originally announced February 2021.

arXiv:2101.09346 [pdf, ps, other]

On the Local Linear Rate of Consensus on the Stiefel Manifold

Authors: Shixiang Chen, Alfredo Garcia, Mingyi Hong, Shahin Shahrampour

Abstract: We study the convergence properties of Riemannian gradient method for solving the consensus problem (for an undirected connected graph) over the Stiefel manifold. The Stiefel manifold is a non-convex set and the standard notion of averaging in the Euclidean space does not work for this problem. We propose Distributed Riemannian Consensus on Stiefel Manifold (DRCS) and prove that it enjoys a local… ▽ More We study the convergence properties of Riemannian gradient method for solving the consensus problem (for an undirected connected graph) over the Stiefel manifold. The Stiefel manifold is a non-convex set and the standard notion of averaging in the Euclidean space does not work for this problem. We propose Distributed Riemannian Consensus on Stiefel Manifold (DRCS) and prove that it enjoys a local linear convergence rate to global consensus. More importantly, this local rate asymptotically scales with the second largest singular value of the communication matrix, which is on par with the well-known rate in the Euclidean space. To the best of our knowledge, this is the first work showing the equality of the two rates. The main technical challenges include (i) developing a Riemannian restricted secant inequality for convergence analysis, and (ii) to identify the conditions (e.g., suitable step-size and initialization) under which the algorithm always stays in the local region. △ Less

Submitted 22 January, 2021; originally announced January 2021.

arXiv:2011.12233 [pdf, other]

Linear Convergence of Distributed Mirror Descent with Integral Feedback for Strongly Convex Problems

Authors: Youbang Sun, Shahin Shahrampour

Abstract: Distributed optimization often requires finding the minimum of a global objective function written as a sum of local functions. A group of agents work collectively to minimize the global function. We study a continuous-time decentralized mirror descent algorithm that uses purely local gradient information to converge to the global optimal solution. The algorithm enforces consensus among agents usi… ▽ More Distributed optimization often requires finding the minimum of a global objective function written as a sum of local functions. A group of agents work collectively to minimize the global function. We study a continuous-time decentralized mirror descent algorithm that uses purely local gradient information to converge to the global optimal solution. The algorithm enforces consensus among agents using the idea of integral feedback. Recently, Sun and Shahrampour (2020) studied the asymptotic convergence of this algorithm for when the global function is strongly convex but local functions are convex. Using control theory tools, in this work, we prove that the algorithm indeed achieves (local) exponential convergence. We also provide a numerical experiment on a real data-set as a validation of the convergence speed of our algorithm. △ Less

Submitted 24 November, 2020; originally announced November 2020.

Comments: 12 pages, 1 figure

arXiv:2009.13749 [pdf, other]

Distributed Online Linear Quadratic Control for Linear Time-invariant Systems

Authors: Ting-Jui Chang, Shahin Shahrampour

Abstract: Classical linear quadratic (LQ) control centers around linear time-invariant (LTI) systems, where the control-state pairs introduce a quadratic cost with time-invariant parameters. Recent advancement in online optimization and control has provided novel tools to study LQ problems that are robust to time-varying cost parameters. Inspired by this line of research, we study the distributed online LQ… ▽ More Classical linear quadratic (LQ) control centers around linear time-invariant (LTI) systems, where the control-state pairs introduce a quadratic cost with time-invariant parameters. Recent advancement in online optimization and control has provided novel tools to study LQ problems that are robust to time-varying cost parameters. Inspired by this line of research, we study the distributed online LQ problem for identical LTI systems. Consider a multi-agent network where each agent is modeled as an LTI system. The LTI systems are associated with decoupled, time-varying quadratic costs that are revealed sequentially. The goal of the network is to make the control sequence of all agents competitive to that of the best centralized policy in hindsight, captured by the notion of regret. We develop a distributed variant of the online LQ algorithm, which runs distributed online gradient descent with a projection to a semi-definite programming (SDP) to generate controllers. We establish a regret bound scaling as the square root of the finite time-horizon, implying that agents reach consensus as time grows. We further provide numerical experiments verifying our theoretical result. △ Less

Submitted 28 September, 2020; originally announced September 2020.

arXiv:2009.06747 [pdf, other]

Distributed Mirror Descent with Integral Feedback: Asymptotic Convergence Analysis of Continuous-time Dynamics

Authors: Youbang Sun, Shahin Shahrampour

Abstract: This work addresses distributed optimization, where a network of agents wants to minimize a global strongly convex objective function. The global function can be written as a sum of local convex functions, each of which is associated with an agent. We propose a continuous-time distributed mirror descent algorithm that uses purely local information to converge to the global optimum. Unlike previous… ▽ More This work addresses distributed optimization, where a network of agents wants to minimize a global strongly convex objective function. The global function can be written as a sum of local convex functions, each of which is associated with an agent. We propose a continuous-time distributed mirror descent algorithm that uses purely local information to converge to the global optimum. Unlike previous work on distributed mirror descent, we incorporate an integral feedback in the update, allowing the algorithm to converge with a constant step-size when discretized. We establish the asymptotic convergence of the algorithm using Lyapunov stability analysis. We further illustrate numerical experiments that verify the advantage of adopting integral feedback for improving the convergence rate of distributed mirror descent. △ Less

Submitted 14 September, 2020; originally announced September 2020.

arXiv:2006.03912 [pdf, other]

Unconstrained Online Optimization: Dynamic Regret Analysis of Strongly Convex and Smooth Problems

Authors: Ting-Jui Chang, Shahin Shahrampour

Abstract: The regret bound of dynamic online learning algorithms is often expressed in terms of the variation in the function sequence ($V_T$) and/or the path-length of the minimizer sequence after $T$ rounds. For strongly convex and smooth functions, , Zhang et al. establish the squared path-length of the minimizer sequence ($C^*_{2,T}$) as a lower bound on regret. They also show that online gradient desce… ▽ More The regret bound of dynamic online learning algorithms is often expressed in terms of the variation in the function sequence ($V_T$) and/or the path-length of the minimizer sequence after $T$ rounds. For strongly convex and smooth functions, , Zhang et al. establish the squared path-length of the minimizer sequence ($C^*_{2,T}$) as a lower bound on regret. They also show that online gradient descent (OGD) achieves this lower bound using multiple gradient queries per round. In this paper, we focus on unconstrained online optimization. We first show that a preconditioned variant of OGD achieves $O(C^*_{2,T})$ with one gradient query per round. We then propose online optimistic Newton (OON) method for the case when the first and second order information of the function sequence is predictable. The regret bound of OON is captured via the quartic path-length of the minimizer sequence ($C^*_{4,T}$), which can be much smaller than $C^*_{2,T}$. We finally show that by using multiple gradients for OGD, we can achieve an upper bound of $O(\min\{C^*_{2,T},V_T\})$ on regret. △ Less

Submitted 14 August, 2020; v1 submitted 6 June, 2020; originally announced June 2020.

arXiv:2006.03706 [pdf, ps, other]

Learning from Non-Random Data in Hilbert Spaces: An Optimal Recovery Perspective

Authors: Simon Foucart, Chunyang Liao, Shahin Shahrampour, Yinsong Wang

Abstract: The notion of generalization in classical Statistical Learning is often attached to the postulate that data points are independent and identically distributed (IID) random variables. While relevant in many applications, this postulate may not hold in general, encouraging the development of learning frameworks that are robust to non-IID data. In this work, we consider the regression problem from an… ▽ More The notion of generalization in classical Statistical Learning is often attached to the postulate that data points are independent and identically distributed (IID) random variables. While relevant in many applications, this postulate may not hold in general, encouraging the development of learning frameworks that are robust to non-IID data. In this work, we consider the regression problem from an Optimal Recovery perspective. Relying on a model assumption comparable to choosing a hypothesis class, a learner aims at minimizing the worst-case error, without recourse to any probabilistic assumption on the data. We first develop a semidefinite program for calculating the worst-case error of any recovery map in finite-dimensional Hilbert spaces. Then, for any Hilbert space, we show that Optimal Recovery provides a formula which is user-friendly from an algorithmic point-of-view, as long as the hypothesis class is linear. Interestingly, this formula coincides with kernel ridgeless regression in some cases, proving that minimizing the average error and worst-case error can yield the same solution. We provide numerical experiments in support of our theoretical findings. △ Less

Submitted 11 September, 2020; v1 submitted 5 June, 2020; originally announced June 2020.

Comments: Title modified; formatting changed; some reorganization and addition of Theorem 4

arXiv:2006.03696 [pdf, other]

High-Dimensional Non-Parametric Density Estimation in Mixed Smooth Sobolev Spaces

Authors: Liang Ding, Lu Zou, Wenjia Wang, Shahin Shahrampour, Rui Tuo

Abstract: Density estimation plays a key role in many tasks in machine learning, statistical inference, and visualization. The main bottleneck in high-dimensional density estimation is the prohibitive computational cost and the slow convergence rate. In this paper, we propose novel estimators for high-dimensional non-parametric density estimation called the adaptive hyperbolic cross density estimators, whic… ▽ More Density estimation plays a key role in many tasks in machine learning, statistical inference, and visualization. The main bottleneck in high-dimensional density estimation is the prohibitive computational cost and the slow convergence rate. In this paper, we propose novel estimators for high-dimensional non-parametric density estimation called the adaptive hyperbolic cross density estimators, which enjoys nice convergence properties in the mixed smooth Sobolev spaces. As modifications of the usual Sobolev spaces, the mixed smooth Sobolev spaces are more suitable for describing high-dimensional density functions in some applications. We prove that, unlike other existing approaches, the proposed estimator does not suffer the curse of dimensionality under Integral Probability Metric, including Hölder Integral Probability Metric, where Total Variation Metric and Wasserstein Distance are special cases. Applications of the proposed estimators to generative adversarial networks (GANs) and goodness of fit test for high-dimensional data are discussed to illustrate the proposed estimator's good performance in high-dimensional problems. Numerical experiments are conducted and illustrate the efficiency of our proposed method. △ Less

Submitted 20 October, 2021; v1 submitted 5 June, 2020; originally announced June 2020.

arXiv:2004.13233 [pdf, other]

doi 10.1109/TAC.2021.3056535

On Distributed Non-convex Optimization: Projected Subgradient Method For Weakly Convex Problems in Networks

Authors: Shixiang Chen, Alfredo Garcia, Shahin Shahrampour

Abstract: The stochastic subgradient method is a widely-used algorithm for solving large-scale optimization problems arising in machine learning. Often these problems are neither smooth nor convex. Recently, Davis et al. [1-2] characterized the convergence of the stochastic subgradient method for the weakly convex case, which encompasses many important applications (e.g., robust phase retrieval, blind decon… ▽ More The stochastic subgradient method is a widely-used algorithm for solving large-scale optimization problems arising in machine learning. Often these problems are neither smooth nor convex. Recently, Davis et al. [1-2] characterized the convergence of the stochastic subgradient method for the weakly convex case, which encompasses many important applications (e.g., robust phase retrieval, blind deconvolution, biconvex compressive sensing, and dictionary learning). In practice, distributed implementations of the projected stochastic subgradient method (stoDPSM) are used to speed-up risk minimization. In this paper, we propose a distributed implementation of the stochastic subgradient method with a theoretical guarantee. Specifically, we show the global convergence of stoDPSM using the Moreau envelope stationarity measure. Furthermore, under a so-called sharpness condition, we show that deterministic DPSM (with a proper initialization) converges linearly to the sharp minima, using geometrically diminishing step-size. We provide numerical experiments to support our theoretical analysis. △ Less

Submitted 23 February, 2021; v1 submitted 27 April, 2020; originally announced April 2020.

arXiv:2003.05783 [pdf, other]

Statistical and Topological Properties of Sliced Probability Divergences

Authors: Kimia Nadjahi, Alain Durmus, Lénaïc Chizat, Soheil Kolouri, Shahin Shahrampour, Umut Şimşekli

Abstract: The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a `base divergence' between one-dimensional random projections of the two measures. However, the topological, statistical, and computational consequences of this technique hav… ▽ More The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a `base divergence' between one-dimensional random projections of the two measures. However, the topological, statistical, and computational consequences of this technique have not yet been well-established. In this paper, we aim at bridging this gap and derive various theoretical properties of sliced probability divergences. First, we show that slicing preserves the metric axioms and the weak continuity of the divergence, implying that the sliced divergence will share similar topological properties. We then precise the results in the case where the base divergence belongs to the class of integral probability metrics. On the other hand, we establish that, under mild conditions, the sample complexity of a sliced divergence does not depend on the problem dimension. We finally apply our general results to several base divergences, and illustrate our theory on both synthetic and real data experiments. △ Less

Submitted 4 January, 2022; v1 submitted 12 March, 2020; originally announced March 2020.

Comments: Published at NeurIPS 2020 (Spotlight)

arXiv:2002.12537 [pdf, other]

Generalized Sliced Distances for Probability Distributions

Authors: Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Shahin Shahrampour

Abstract: Probability metrics have become an indispensable part of modern statistics and machine learning, and they play a quintessential role in various applications, including statistical hypothesis testing and generative modeling. However, in a practical setting, the convergence behavior of the algorithms built upon these distances have not been well established, except for a few specific cases. In this… ▽ More Probability metrics have become an indispensable part of modern statistics and machine learning, and they play a quintessential role in various applications, including statistical hypothesis testing and generative modeling. However, in a practical setting, the convergence behavior of the algorithms built upon these distances have not been well established, except for a few specific cases. In this paper, we introduce a broad family of probability metrics, coined as Generalized Sliced Probability Metrics (GSPMs), that are deeply rooted in the generalized Radon transform. We first verify that GSPMs are metrics. Then, we identify a subset of GSPMs that are equivalent to maximum mean discrepancy (MMD) with novel positive definite kernels, which come with a unique geometric interpretation. Finally, by exploiting this connection, we consider GSPM-based gradient flows for generative modeling applications and show that under mild assumptions, the gradient flow converges to the global optimum. We illustrate the utility of our approach on both real and synthetic problems. △ Less

Submitted 27 February, 2020; originally announced February 2020.

arXiv:2002.04753 [pdf, other]

RFN: A Random-Feature Based Newton Method for Empirical Risk Minimization in Reproducing Kernel Hilbert Spaces

Authors: Ting-Jui Chang, Shahin Shahrampour

Abstract: In supervised learning using kernel methods, we often encounter a large-scale finite-sum minimization over a reproducing kernel Hilbert space (RKHS). Large-scale finite-sum problems can be solved using efficient variants of Newton method, where the Hessian is approximated via sub-samples of data. In RKHS, however, the dependence of the penalty function to kernel makes standard sub-sampling approac… ▽ More In supervised learning using kernel methods, we often encounter a large-scale finite-sum minimization over a reproducing kernel Hilbert space (RKHS). Large-scale finite-sum problems can be solved using efficient variants of Newton method, where the Hessian is approximated via sub-samples of data. In RKHS, however, the dependence of the penalty function to kernel makes standard sub-sampling approaches inapplicable, since the gram matrix is not readily available in a low-rank form. In this paper, we observe that for this class of problems, one can naturally use kernel approximation to speed up the Newton method. Focusing on randomized features for kernel approximation, we provide a novel second-order algorithm that enjoys local superlinear convergence and global linear convergence (with high probability). We derive the theoretical lower bound for the number of random features required for the approximated Hessian to be close to the true Hessian in the norm sense. Our numerical experiments on real-world data verify the efficiency of our method compared to several benchmarks. △ Less

Submitted 6 June, 2022; v1 submitted 11 February, 2020; originally announced February 2020.

arXiv:2002.04195 [pdf, other]

Generalization Guarantees for Sparse Kernel Approximation with Entropic Optimal Features

Authors: Liang Ding, Rui Tuo, Shahin Shahrampour

Abstract: Despite their success, kernel methods suffer from a massive computational cost in practice. In this paper, in lieu of commonly used kernel expansion with respect to $N$ inputs, we develop a novel optimal design maximizing the entropy among kernel features. This procedure results in a kernel expansion with respect to entropic optimal features (EOF), improving the data representation dramatically du… ▽ More Despite their success, kernel methods suffer from a massive computational cost in practice. In this paper, in lieu of commonly used kernel expansion with respect to $N$ inputs, we develop a novel optimal design maximizing the entropy among kernel features. This procedure results in a kernel expansion with respect to entropic optimal features (EOF), improving the data representation dramatically due to features dissimilarity. Under mild technical assumptions, our generalization bound shows that with only $O(N^{\frac{1}{4}})$ features (disregarding logarithmic factors), we can achieve the optimal statistical accuracy (i.e., $O(1/\sqrt{N})$). The salient feature of our design is its sparsity that significantly reduces the time and space cost. Our numerical experiments on benchmark datasets verify the superiority of EOF over the state-of-the-art in kernel approximation. △ Less

Submitted 10 February, 2020; originally announced February 2020.

arXiv:1910.13567 [pdf, other]

Cell Association via Boundary Detection: A Scalable Approach Based on Data-Driven Random Features

Authors: Yinsong Wang, Hessam Mahdavifar, Kamran Entesari, Shahin Shahrampour

Abstract: The problem of cell association is considered for cellular users present in the field. This has become a challenging problem with the deployment of 5G networks which will share the sub-6 GHz bands with the legacy 4G networks. Instead of taking a network-controlled approach, which may not be scalable with the number of users and may introduce extra delays into the system, we propose a scalable solu… ▽ More The problem of cell association is considered for cellular users present in the field. This has become a challenging problem with the deployment of 5G networks which will share the sub-6 GHz bands with the legacy 4G networks. Instead of taking a network-controlled approach, which may not be scalable with the number of users and may introduce extra delays into the system, we propose a scalable solution in the physical layer by utilizing data that can be collected by a large number of spectrum sensors deployed in the field. More specifically, we model the cell association problem as a nonlinear boundary detection problem and focus on solving this problem using randomized shallow networks for determining the boundaries for location of users associated to each cell. We exploit the power of data-driven modeling to reduce the computational cost of training in the proposed solution for the cell association problem. This is equivalent to choosing the right basis functions in the shallow architecture such that the detection is done with minimal error. Our experiments demonstrate the superiority of this method compared to its data-independent counterparts as well as its computational advantage over kernel methods. △ Less

Submitted 29 October, 2019; originally announced October 2019.

Comments: 6 pages

arXiv:1910.05384 [pdf, other]

ORCCA: Optimal Randomized Canonical Correlation Analysis

Authors: Yinsong Wang, Shahin Shahrampour

Abstract: Random features approach has been widely used for kernel approximation in large-scale machine learning. A number of recent studies have explored data-dependent sampling of features, modifying the stochastic oracle from which random features are sampled. While proposed techniques in this realm improve the approximation, their suitability is often verified on a single learning task. In this paper, w… ▽ More Random features approach has been widely used for kernel approximation in large-scale machine learning. A number of recent studies have explored data-dependent sampling of features, modifying the stochastic oracle from which random features are sampled. While proposed techniques in this realm improve the approximation, their suitability is often verified on a single learning task. In this paper, we propose a task-specific scoring rule for selecting random features, which can be employed for different applications with some adjustments. We restrict our attention to Canonical Correlation Analysis (CCA), and we provide a novel, principled guide for finding the score function maximizing the canonical correlations. We prove that this method, called ORCCA, can outperform (in expectation) the corresponding Kernel CCA with a default kernel. Numerical experiments verify that ORCCA is significantly superior than other approximation techniques in the CCA task. △ Less

Submitted 1 November, 2021; v1 submitted 11 October, 2019; originally announced October 2019.

arXiv:1909.11820 [pdf, other]

A Mean-Field Theory for Kernel Alignment with Random Features in Generative and Discriminative Models

Authors: Masoud Badiei Khuzani, Liyue Shen, Shahin Shahrampour, Lei Xing

Abstract: We propose a novel supervised learning method to optimize the kernel in the maximum mean discrepancy generative adversarial networks (MMD GANs), and the kernel support vector machines (SVMs). Specifically, we characterize a distributionally robust optimization problem to compute a good distribution for the random feature model of Rahimi and Recht. Due to the fact that the distributional optimizati… ▽ More We propose a novel supervised learning method to optimize the kernel in the maximum mean discrepancy generative adversarial networks (MMD GANs), and the kernel support vector machines (SVMs). Specifically, we characterize a distributionally robust optimization problem to compute a good distribution for the random feature model of Rahimi and Recht. Due to the fact that the distributional optimization is infinite dimensional, we consider a Monte-Carlo sample average approximation (SAA) to obtain a more tractable finite dimensional optimization problem. We subsequently leverage a particle stochastic gradient descent (SGD) method to solve the derived finite dimensional optimization problem. Based on a mean-field analysis, we then prove that the empirical distribution of the interactive particles system at each iteration of the SGD follows the path of the gradient descent flow on the Wasserstein manifold. We also establish the non-asymptotic consistency of the finite sample estimator. We evaluate our kernel learning method for the hypothesis testing problem by evaluating the kernel MMD statistics, and show that our learning method indeed attains better power of the test for larger threshold values compared to an untrained kernel. Moreover, our empirical evaluation on benchmark data-sets shows the advantage of our kernel learning approach compared to alternative kernel learning methods. △ Less

Submitted 21 February, 2020; v1 submitted 25 September, 2019; originally announced September 2019.

Comments: 51 pages, 4 figures. In this edition, new simulations for the kernel SVMs are included

arXiv:1909.09736 [pdf, other]

Distributed Parameter Estimation in Randomized One-hidden-layer Neural Networks

Authors: Yinsong Wang, Shahin Shahrampour

Abstract: This paper addresses distributed parameter estimation in randomized one-hidden-layer neural networks. A group of agents sequentially receive measurements of an unknown parameter that is only partially observable to them. In this paper, we present a fully distributed estimation algorithm where agents exchange local estimates with their neighbors to collectively identify the true value of the parame… ▽ More This paper addresses distributed parameter estimation in randomized one-hidden-layer neural networks. A group of agents sequentially receive measurements of an unknown parameter that is only partially observable to them. In this paper, we present a fully distributed estimation algorithm where agents exchange local estimates with their neighbors to collectively identify the true value of the parameter. We prove that this distributed update provides an asymptotically unbiased estimator of the unknown parameter, i.e., the first moment of the expected global error converges to zero asymptotically. We further analyze the efficiency of the proposed estimation scheme by establishing an asymptotic upper bound on the variance of the global error. Applying our method to a real-world dataset related to appliances energy prediction, we observe that our empirical findings verify the theoretical results. △ Less

Submitted 20 March, 2020; v1 submitted 20 September, 2019; originally announced September 2019.

Comments: 6 Pages

arXiv:1903.08329 [pdf, other]

On Sampling Random Features From Empirical Leverage Scores: Implementation and Theoretical Guarantees

Authors: Shahin Shahrampour, Soheil Kolouri

Abstract: Random features provide a practical framework for large-scale kernel approximation and supervised learning. It has been shown that data-dependent sampling of random features using leverage scores can significantly reduce the number of features required to achieve optimal learning bounds. Leverage scores introduce an optimized distribution for features based on an infinite-dimensional integral oper… ▽ More Random features provide a practical framework for large-scale kernel approximation and supervised learning. It has been shown that data-dependent sampling of random features using leverage scores can significantly reduce the number of features required to achieve optimal learning bounds. Leverage scores introduce an optimized distribution for features based on an infinite-dimensional integral operator (depending on input distribution), which is impractical to sample from. Focusing on empirical leverage scores in this paper, we establish an out-of-sample performance bound, revealing an interesting trade-off between the approximated kernel and the eigenvalue decay of another kernel in the domain of random features defined based on data distribution. Our experiments verify that the empirical algorithm consistently outperforms vanilla Monte Carlo sampling, and with a minor modification the method is even competitive to supervised data-dependent kernel learning, without using the output (label) information. △ Less

Submitted 19 March, 2019; originally announced March 2019.

Comments: 23 pages

arXiv:1810.03817 [pdf, ps, other]

Learning Bounds for Greedy Approximation with Explicit Feature Maps from Multiple Kernels

Authors: Shahin Shahrampour, Vahid Tarokh

Abstract: Nonlinear kernels can be approximated using finite-dimensional feature maps for efficient risk minimization. Due to the inherent trade-off between the dimension of the (mapped) feature space and the approximation accuracy, the key problem is to identify promising (explicit) features leading to a satisfactory out-of-sample performance. In this work, we tackle this problem by efficiently choosing su… ▽ More Nonlinear kernels can be approximated using finite-dimensional feature maps for efficient risk minimization. Due to the inherent trade-off between the dimension of the (mapped) feature space and the approximation accuracy, the key problem is to identify promising (explicit) features leading to a satisfactory out-of-sample performance. In this work, we tackle this problem by efficiently choosing such features from multiple kernels in a greedy fashion. Our method sequentially selects these explicit features from a set of candidate features using a correlation metric. We establish an out-of-sample error bound capturing the trade-off between the error in terms of explicit features (approximation error) and the error due to spectral properties of the best model in the Hilbert space associated to the combined kernel (spectral error). The result verifies that when the (best) underlying data model is sparse enough, i.e., the spectral error is negligible, one can control the test error with a small number of explicit features, that can scale poly-logarithmically with data. Our empirical results show that given a fixed number of explicit features, the method can achieve a lower test error with a smaller time cost, compared to the state-of-the-art in data-dependent random features. △ Less

Submitted 9 October, 2018; originally announced October 2018.

Comments: Proc. of 2018 Advances in Neural Information Processing Systems (NIPS 2018)

arXiv:1712.07102 [pdf, other]

On Data-Dependent Random Features for Improved Generalization in Supervised Learning

Authors: Shahin Shahrampour, Ahmad Beirami, Vahid Tarokh

Abstract: The randomized-feature approach has been successfully employed in large-scale kernel approximation and supervised learning. The distribution from which the random features are drawn impacts the number of features required to efficiently perform a learning task. Recently, it has been shown that employing data-dependent randomization improves the performance in terms of the required number of random… ▽ More The randomized-feature approach has been successfully employed in large-scale kernel approximation and supervised learning. The distribution from which the random features are drawn impacts the number of features required to efficiently perform a learning task. Recently, it has been shown that employing data-dependent randomization improves the performance in terms of the required number of random features. In this paper, we are concerned with the randomized-feature approach in supervised learning for good generalizability. We propose the Energy-based Exploration of Random Features (EERF) algorithm based on a data-dependent score function that explores the set of possible features and exploits the promising regions. We prove that the proposed score function with high probability recovers the spectrum of the best fit within the model class. Our empirical results on several benchmark datasets further verify that our method requires smaller number of random features to achieve a certain generalization error compared to the state-of-the-art while introducing negligible pre-processing overhead. EERF can be implemented in a few lines of code and requires no additional tuning parameters. △ Less

Submitted 19 December, 2017; originally announced December 2017.

Comments: 12 pages; (pages 1-8) to appear in Proc. of AAAI Conference on Artificial Intelligence (AAAI), 2018

arXiv:1711.05323 [pdf, other]

On Optimal Generalizability in Parametric Learning

Authors: Ahmad Beirami, Meisam Razaviyayn, Shahin Shahrampour, Vahid Tarokh

Abstract: We consider the parametric learning problem, where the objective of the learner is determined by a parametric loss function. Employing empirical risk minimization with possibly regularization, the inferred parameter vector will be biased toward the training samples. Such bias is measured by the cross validation procedure in practice where the data set is partitioned into a training set used for tr… ▽ More We consider the parametric learning problem, where the objective of the learner is determined by a parametric loss function. Employing empirical risk minimization with possibly regularization, the inferred parameter vector will be biased toward the training samples. Such bias is measured by the cross validation procedure in practice where the data set is partitioned into a training set used for training and a validation set, which is not used in training and is left to measure the out-of-sample performance. A classical cross validation strategy is the leave-one-out cross validation (LOOCV) where one sample is left out for validation and training is done on the rest of the samples that are presented to the learner, and this process is repeated on all of the samples. LOOCV is rarely used in practice due to the high computational complexity. In this paper, we first develop a computationally efficient approximate LOOCV (ALOOCV) and provide theoretical guarantees for its performance. Then we use ALOOCV to provide an optimization algorithm for finding the regularizer in the empirical risk minimization framework. In our numerical experiments, we illustrate the accuracy and efficiency of ALOOCV as well as our proposed framework for the optimization of the regularizer. △ Less

Submitted 14 November, 2017; originally announced November 2017.

Comments: Proc. of 2017 Advances in Neural Information Processing Systems (NIPS 2017)

arXiv:1707.02649 [pdf, ps, other]

Nonlinear Sequential Accepts and Rejects for Identification of Top Arms in Stochastic Bandits

Authors: Shahin Shahrampour, Vahid Tarokh

Abstract: We address the M-best-arm identification problem in multi-armed bandits. A player has a limited budget to explore K arms (M<K), and once pulled, each arm yields a reward drawn (independently) from a fixed, unknown distribution. The goal is to find the top M arms in the sense of expected reward. We develop an algorithm which proceeds in rounds to deactivate arms iteratively. At each round, the budg… ▽ More We address the M-best-arm identification problem in multi-armed bandits. A player has a limited budget to explore K arms (M<K), and once pulled, each arm yields a reward drawn (independently) from a fixed, unknown distribution. The goal is to find the top M arms in the sense of expected reward. We develop an algorithm which proceeds in rounds to deactivate arms iteratively. At each round, the budget is divided by a nonlinear function of remaining arms, and the arms are pulled correspondingly. Based on a decision rule, the deactivated arm at each round may be accepted or rejected. The algorithm outputs the accepted arms that should ideally be the top M arms. We characterize the decay rate of the misidentification probability and establish that the nonlinear budget allocation proves to be useful for different problem environments (described by the number of competitive arms). We provide comprehensive numerical experiments showing that our algorithm outperforms the state-of-the-art using suitable nonlinearity. △ Less

Submitted 9 July, 2017; originally announced July 2017.

Comments: 7 pages

arXiv:1702.06219 [pdf, other]

An Online Optimization Approach for Multi-Agent Tracking of Dynamic Parameters in the Presence of Adversarial Noise

Authors: Shahin Shahrampour, Ali Jadbabaie

Abstract: This paper addresses tracking of a moving target in a multi-agent network. The target follows a linear dynamics corrupted by an adversarial noise, i.e., the noise is not generated from a statistical distribution. The location of the target at each time induces a global time-varying loss function, and the global loss is a sum of local losses, each of which is associated to one agent. Agents noisy o… ▽ More This paper addresses tracking of a moving target in a multi-agent network. The target follows a linear dynamics corrupted by an adversarial noise, i.e., the noise is not generated from a statistical distribution. The location of the target at each time induces a global time-varying loss function, and the global loss is a sum of local losses, each of which is associated to one agent. Agents noisy observations could be nonlinear. We formulate this problem as a distributed online optimization where agents communicate with each other to track the minimizer of the global loss. We then propose a decentralized version of the Mirror Descent algorithm and provide the non-asymptotic analysis of the problem. Using the notion of dynamic regret, we measure the performance of our algorithm versus its offline counterpart in the centralized setting. We prove that the bound on dynamic regret scales inversely in the network spectral gap, and it represents the adversarial noise causing deviation with respect to the linear dynamics. Our result subsumes a number of results in the distributed optimization literature. Finally, in a numerical experiment, we verify that our algorithm can be simply implemented for multi-agent tracking with nonlinear observations. △ Less

Submitted 20 February, 2017; originally announced February 2017.

Comments: 8 pages, To appear in American Control Conference 2017

arXiv:1609.02845 [pdf, other]

Distributed Online Optimization in Dynamic Environments Using Mirror Descent

Authors: Shahin Shahrampour, Ali Jadbabaie

Abstract: This work addresses decentralized online optimization in non-stationary environments. A network of agents aim to track the minimizer of a global time-varying convex function. The minimizer evolves according to a known dynamics corrupted by an unknown, unstructured noise. At each time, the global function can be cast as a sum of a finite number of local functions, each of which is assigned to one a… ▽ More This work addresses decentralized online optimization in non-stationary environments. A network of agents aim to track the minimizer of a global time-varying convex function. The minimizer evolves according to a known dynamics corrupted by an unknown, unstructured noise. At each time, the global function can be cast as a sum of a finite number of local functions, each of which is assigned to one agent in the network. Moreover, the local functions become available to agents sequentially, and agents do not have a prior knowledge of the future cost functions. Therefore, agents must communicate with each other to build an online approximation of the global function. We propose a decentralized variation of the celebrated Mirror Descent, developed by Nemirovksi and Yudin. Using the notion of Bregman divergence in lieu of Euclidean distance for projection, Mirror Descent has been shown to be a powerful tool in large-scale optimization. Our algorithm builds on Mirror Descent, while ensuring that agents perform a consensus step to follow the global function and take into account the dynamics of the global minimizer. To measure the performance of the proposed online algorithm, we compare it to its offline counterpart, where the global functions are available a priori. The gap between the two is called dynamic regret. We establish a regret bound that scales inversely in the spectral gap of the network, and more notably it represents the deviation of minimizer sequence with respect to the given dynamics. We then show that our results subsume a number of results in distributed optimization. We demonstrate the application of our method to decentralized tracking of dynamic parameters and verify the results via numerical experiments. △ Less

Submitted 9 September, 2016; originally announced September 2016.

arXiv:1609.02606 [pdf, ps, other]

doi 10.1109/TSP.2017.2706192

On Sequential Elimination Algorithms for Best-Arm Identification in Multi-Armed Bandits

Authors: Shahin Shahrampour, Mohammad Noshad, Vahid Tarokh

Abstract: We consider the best-arm identification problem in multi-armed bandits, which focuses purely on exploration. A player is given a fixed budget to explore a finite set of arms, and the rewards of each arm are drawn independently from a fixed, unknown distribution. The player aims to identify the arm with the largest expected reward. We propose a general framework to unify sequential elimination algo… ▽ More We consider the best-arm identification problem in multi-armed bandits, which focuses purely on exploration. A player is given a fixed budget to explore a finite set of arms, and the rewards of each arm are drawn independently from a fixed, unknown distribution. The player aims to identify the arm with the largest expected reward. We propose a general framework to unify sequential elimination algorithms, where the arms are dismissed iteratively until a unique arm is left. Our analysis reveals a novel performance measure expressed in terms of the sampling mechanism and number of eliminated arms at each round. Based on this result, we develop an algorithm that divides the budget according to a nonlinear function of remaining arms at each round. We provide theoretical guarantees for the algorithm, characterizing the suitable nonlinearity for different problem environments described by the number of competitive arms. Matching the theoretical results, our experiments show that the nonlinear algorithm outperforms the state-of-the-art. We finally study the side-observation model, where pulling an arm reveals the rewards of its related arms, and we establish improved theoretical guarantees in the pure-exploration setting. △ Less

Submitted 13 April, 2017; v1 submitted 8 September, 2016; originally announced September 2016.

arXiv:1603.04954 [pdf, other]

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Authors: Aryan Mokhtari, Shahin Shahrampour, Ali Jadbabaie, Alejandro Ribeiro

Abstract: In this paper, we address tracking of a time-varying parameter with unknown dynamics. We formalize the problem as an instance of online optimization in a dynamic setting. Using online gradient descent, we propose a method that sequentially predicts the value of the parameter and in turn suffers a loss. The objective is to minimize the accumulation of losses over the time horizon, a notion that is… ▽ More In this paper, we address tracking of a time-varying parameter with unknown dynamics. We formalize the problem as an instance of online optimization in a dynamic setting. Using online gradient descent, we propose a method that sequentially predicts the value of the parameter and in turn suffers a loss. The objective is to minimize the accumulation of losses over the time horizon, a notion that is termed dynamic regret. While existing methods focus on convex loss functions, we consider strongly convex functions so as to provide better guarantees of performance. We derive a regret bound that captures the path-length of the time-varying parameter, defined in terms of the distance between its consecutive values. In other words, the bound represents the natural connection of tracking quality to the rate of change of the parameter. We provide numerical experiments to complement our theoretical findings. △ Less

Submitted 16 March, 2016; originally announced March 2016.

arXiv:1603.00576 [pdf, ps, other]

Distributed Estimation of Dynamic Parameters : Regret Analysis

Authors: Shahin Shahrampour, Alexander Rakhlin, Ali Jadbabaie

Abstract: This paper addresses the estimation of a time- varying parameter in a network. A group of agents sequentially receive noisy signals about the parameter (or moving target), which does not follow any particular dynamics. The parameter is not observable to an individual agent, but it is globally identifiable for the whole network. Viewing the problem with an online optimization lens, we aim to provid… ▽ More This paper addresses the estimation of a time- varying parameter in a network. A group of agents sequentially receive noisy signals about the parameter (or moving target), which does not follow any particular dynamics. The parameter is not observable to an individual agent, but it is globally identifiable for the whole network. Viewing the problem with an online optimization lens, we aim to provide the finite-time or non-asymptotic analysis of the problem. To this end, we use a notion of dynamic regret which suits the online, non-stationary nature of the problem. In our setting, dynamic regret can be recognized as a finite-time counterpart of stability in the mean- square sense. We develop a distributed, online algorithm for tracking the moving target. Defining the path-length as the consecutive differences between target locations, we express an upper bound on regret in terms of the path-length of the target and network errors. We further show the consistency of the result with static setting and noiseless observations. △ Less

Submitted 1 March, 2016; originally announced March 2016.

Comments: 6 pages, To appear in American Control Conference 2016

arXiv:1503.03517 [pdf, ps, other]

Switching to Learn

Authors: Shahin Shahrampour, Mohammad Amin Rahimian, Ali Jadbabaie

Abstract: A network of agents attempt to learn some unknown state of the world drawn by nature from a finite set. Agents observe private signals conditioned on the true state, and form beliefs about the unknown state accordingly. Each agent may face an identification problem in the sense that she cannot distinguish the truth in isolation. However, by communicating with each other, agents are able to benefit… ▽ More A network of agents attempt to learn some unknown state of the world drawn by nature from a finite set. Agents observe private signals conditioned on the true state, and form beliefs about the unknown state accordingly. Each agent may face an identification problem in the sense that she cannot distinguish the truth in isolation. However, by communicating with each other, agents are able to benefit from side observations to learn the truth collectively. Unlike many distributed algorithms which rely on all-time communication protocols, we propose an efficient method by switching between Bayesian and non-Bayesian regimes. In this model, agents exchange information only when their private signals are not informative enough; thence, by switching between the two regimes, agents efficiently learn the truth using only a few rounds of communications. The proposed algorithm preserves learnability while incurring a lower communication cost. We also verify our theoretical findings by simulation examples. △ Less

Submitted 11 March, 2015; originally announced March 2015.

Comments: 6 pages, To appear in American Control Conference 2015

arXiv:1501.06225 [pdf, ps, other]

Online Optimization : Competing with Dynamic Comparators

Authors: Ali Jadbabaie, Alexander Rakhlin, Shahin Shahrampour, Karthik Sridharan

Abstract: Recent literature on online learning has focused on developing adaptive algorithms that take advantage of a regularity of the sequence of observations, yet retain worst-case performance guarantees. A complementary direction is to develop prediction methods that perform well against complex benchmarks. In this paper, we address these two directions together. We present a fully adaptive method that… ▽ More Recent literature on online learning has focused on developing adaptive algorithms that take advantage of a regularity of the sequence of observations, yet retain worst-case performance guarantees. A complementary direction is to develop prediction methods that perform well against complex benchmarks. In this paper, we address these two directions together. We present a fully adaptive method that competes with dynamic benchmarks in which regret guarantee scales with regularity of the sequence of cost functions and comparators. Notably, the regret bound adapts to the smaller complexity measure in the problem environment. Finally, we apply our results to drifting zero-sum, two-player games where both players achieve no regret guarantees against best sequences of actions in hindsight. △ Less

Submitted 25 January, 2015; originally announced January 2015.

Comments: 23 pages, To appear in International Conference on Artificial Intelligence and Statistics (AISTATS) 2015

arXiv:1409.8606 [pdf, other]

Distributed Detection : Finite-time Analysis and Impact of Network Topology

Authors: Shahin Shahrampour, Alexander Rakhlin, Ali Jadbabaie

Abstract: This paper addresses the problem of distributed detection in multi-agent networks. Agents receive private signals about an unknown state of the world. The underlying state is globally identifiable, yet informative signals may be dispersed throughout the network. Using an optimization-based framework, we develop an iterative local strategy for updating individual beliefs. In contrast to the existin… ▽ More This paper addresses the problem of distributed detection in multi-agent networks. Agents receive private signals about an unknown state of the world. The underlying state is globally identifiable, yet informative signals may be dispersed throughout the network. Using an optimization-based framework, we develop an iterative local strategy for updating individual beliefs. In contrast to the existing literature which focuses on asymptotic learning, we provide a finite-time analysis. Furthermore, we introduce a Kullback-Leibler cost to compare the efficiency of the algorithm to its centralized counterpart. Our bounds on the cost are expressed in terms of network size, spectral gap, centrality of each agent and relative entropy of agents' signal structures. A key observation is that distributing more informative signals to central agents results in a faster learning rate. Furthermore, optimizing the weights, we can speed up learning by improving the spectral gap. We also quantify the effect of link failures on learning speed in symmetric networks. We finally provide numerical simulations which verify our theoretical results. △ Less

Submitted 30 September, 2014; originally announced September 2014.

Comments: 29 pages, 5 figures

arXiv:1310.0432 [pdf, ps, other]

Online Learning of Dynamic Parameters in Social Networks

Authors: Shahin Shahrampour, Alexander Rakhlin, Ali Jadbabaie

Abstract: This paper addresses the problem of online learning in a dynamic setting. We consider a social network in which each individual observes a private signal about the underlying state of the world and communicates with her neighbors at each time period. Unlike many existing approaches, the underlying state is dynamic, and evolves according to a geometric random walk. We view the scenario as an optimi… ▽ More This paper addresses the problem of online learning in a dynamic setting. We consider a social network in which each individual observes a private signal about the underlying state of the world and communicates with her neighbors at each time period. Unlike many existing approaches, the underlying state is dynamic, and evolves according to a geometric random walk. We view the scenario as an optimization problem where agents aim to learn the true state while suffering the smallest possible loss. Based on the decomposition of the global loss function, we introduce two update mechanisms, each of which generates an estimate of the true state. We establish a tight bound on the rate of change of the underlying state, under which individuals can track the parameter with a bounded variance. Then, we characterize explicit expressions for the steady state mean-square deviation(MSD) of the estimates from the truth, per individual. We observe that only one of the estimators recovers the optimal MSD, which underscores the impact of the objective function decomposition on the learning quality. Finally, we provide an upper bound on the regret of the proposed methods, measured as an average of errors in estimating the parameter in a finite time. △ Less

Submitted 1 October, 2013; originally announced October 2013.

Comments: 12 pages, To appear in Neural Information Processing Systems (NIPS) 2013

arXiv:1309.2350 [pdf, ps, other]

Exponentially Fast Parameter Estimation in Networks Using Distributed Dual Averaging

Authors: Shahin Shahrampour, Ali Jadbabaie

Abstract: In this paper we present an optimization-based view of distributed parameter estimation and observational social learning in networks. Agents receive a sequence of random, independent and identically distributed (i.i.d.) signals, each of which individually may not be informative about the underlying true state, but the signals together are globally informative enough to make the true state identif… ▽ More In this paper we present an optimization-based view of distributed parameter estimation and observational social learning in networks. Agents receive a sequence of random, independent and identically distributed (i.i.d.) signals, each of which individually may not be informative about the underlying true state, but the signals together are globally informative enough to make the true state identifiable. Using an optimization-based characterization of Bayesian learning as proximal stochastic gradient descent (with Kullback-Leibler divergence from a prior as a proximal function), we show how to efficiently use a distributed, online variant of Nesterov's dual averaging method to solve the estimation with purely local information. When the true state is globally identifiable, and the network is connected, we prove that agents eventually learn the true parameter using a randomized gossip scheme. We demonstrate that with high probability the convergence is exponentially fast with a rate dependent on the KL divergence of observations under the true state from observations under the second likeliest state. Furthermore, our work also highlights the possibility of learning under continuous adaptation of network which is a consequence of employing constant, unit stepsize for the algorithm. △ Less

Submitted 9 September, 2013; originally announced September 2013.

Comments: 6 pages, To appear in Conference on Decision and Control 2013

arXiv:1303.3250 [pdf, ps, other]

Reconstruction of Directed Networks from Consensus Dynamics

Authors: Shahin Shahrampour, Victor M. Preciado

Abstract: This paper addresses the problem of identifying the topology of an unknown, weighted, directed network running a consensus dynamics. We propose a methodology to reconstruct the network topology from the dynamic response when the system is stimulated by a wide-sense stationary noise of unknown power spectral density. The method is based on a node-knockout, or grounding, procedure wherein the ground… ▽ More This paper addresses the problem of identifying the topology of an unknown, weighted, directed network running a consensus dynamics. We propose a methodology to reconstruct the network topology from the dynamic response when the system is stimulated by a wide-sense stationary noise of unknown power spectral density. The method is based on a node-knockout, or grounding, procedure wherein the grounded node broadcasts zero without being eliminated from the network. In this direction, we measure the empirical cross-power spectral densities of the outputs between every pair of nodes for both grounded and ungrounded consensus to reconstruct the unknown topology of the network. We also establish that in the special cases of undirected or purely unidirectional networks, the reconstruction does not need grounding. Finally, we extend our results to the case of a directed network assuming a general dynamics, and prove that the developed method can detect edges and their direction. △ Less

Submitted 15 March, 2013; v1 submitted 13 March, 2013; originally announced March 2013.

Comments: 6 pages

Journal ref: S. Shahrampour and V.M. Preciado,"Reconstruction of Directed Networks from Consensus Dynamics," in Proc. American Control Conference, 2013

Showing 1–49 of 49 results for author: Shahrampour, S