Search | arXiv e-print repository

RESIST: Resilient Decentralized Learning Using Consensus Gradient Descent

Authors: Cheng Fang, Rishabh Dixit, Waheed U. Bajwa, Mert Gurbuzbalaban

Abstract: Empirical risk minimization (ERM) is a cornerstone of modern machine learning (ML), supported by advances in optimization theory that ensure efficient solutions with provable algorithmic convergence rates, which measure the speed at which optimization algorithms approach a solution, and statistical learning rates, which characterize how well the solution generalizes to unseen data. Privacy, memory… ▽ More Empirical risk minimization (ERM) is a cornerstone of modern machine learning (ML), supported by advances in optimization theory that ensure efficient solutions with provable algorithmic convergence rates, which measure the speed at which optimization algorithms approach a solution, and statistical learning rates, which characterize how well the solution generalizes to unseen data. Privacy, memory, computational, and communications constraints increasingly necessitate data collection, processing, and storage across network-connected devices. In many applications, these networks operate in decentralized settings where a central server cannot be assumed, requiring decentralized ML algorithms that are both efficient and resilient. Decentralized learning, however, faces significant challenges, including an increased attack surface for adversarial interference during decentralized learning processes. This paper focuses on the man-in-the-middle (MITM) attack, which can cause models to deviate significantly from their intended ERM solutions. To address this challenge, we propose RESIST (Resilient dEcentralized learning using conSensus gradIent deScenT), an optimization algorithm designed to be robust against adversarially compromised communication links. RESIST achieves algorithmic and statistical convergence for strongly convex, Polyak-Lojasiewicz, and nonconvex ERM problems. Experimental results demonstrate the robustness and scalability of RESIST for real-world decentralized learning in adversarial environments. △ Less

Submitted 11 February, 2025; originally announced February 2025.

Comments: preprint of a journal paper; 100 pages and 17 figures

arXiv:2308.02922 [pdf, other]

Structured Low-Rank Tensors for Generalized Linear Models

Authors: Batoul Taki, Anand D. Sarwate, Waheed U. Bajwa

Abstract: Recent works have shown that imposing tensor structures on the coefficient tensor in regression problems can lead to more reliable parameter estimation and lower sample complexity compared to vector-based methods. This work investigates a new low-rank tensor model, called Low Separation Rank (LSR), in Generalized Linear Model (GLM) problems. The LSR model -- which generalizes the well-known Tucker… ▽ More Recent works have shown that imposing tensor structures on the coefficient tensor in regression problems can lead to more reliable parameter estimation and lower sample complexity compared to vector-based methods. This work investigates a new low-rank tensor model, called Low Separation Rank (LSR), in Generalized Linear Model (GLM) problems. The LSR model -- which generalizes the well-known Tucker and CANDECOMP/PARAFAC (CP) models, and is a special case of the Block Tensor Decomposition (BTD) model -- is imposed onto the coefficient tensor in the GLM model. This work proposes a block coordinate descent algorithm for parameter estimation in LSR-structured tensor GLMs. Most importantly, it derives a minimax lower bound on the error threshold on estimating the coefficient tensor in LSR tensor GLM problems. The minimax bound is proportional to the intrinsic degrees of freedom in the LSR tensor GLM problem, suggesting that its sample complexity may be significantly lower than that of vectorized GLMs. This result can also be specialised to lower bound the estimation error in CP and Tucker-structured GLMs. The derived bounds are comparable to tight bounds in the literature for Tucker linear regression, and the tightness of the minimax lower bound is further assessed numerically. Finally, numerical experiments on synthetic datasets demonstrate the efficacy of the proposed LSR tensor model for three regression types (linear, logistic and Poisson). Experiments on a collection of medical imaging datasets demonstrate the usefulness of the LSR model over other tensor models (Tucker and CP) on real, imbalanced data with limited available samples. △ Less

Submitted 5 August, 2023; originally announced August 2023.

Comments: 43 pages; published in Transactions on Machine Learning Research (08/2023)

Journal ref: Transactions on Machine Learning Research, Aug. 2023 (https://openreview.net/forum?id=qUxBs3Ln41)

arXiv:2105.14673 [pdf, ps, other]

doi 10.1109/IEEECONF53345.2021.9723149

A Minimax Lower Bound for Low-Rank Matrix-Variate Logistic Regression

Authors: Batoul Taki, Mohsen Ghassemi, Anand D. Sarwate, Waheed U. Bajwa

Abstract: This paper considers the problem of matrix-variate logistic regression. It derives the fundamental error threshold on estimating low-rank coefficient matrices in the logistic regression problem by obtaining a lower bound on the minimax risk. The bound depends explicitly on the dimension and distribution of the covariates, the rank and energy of the coefficient matrix, and the number of samples. Th… ▽ More This paper considers the problem of matrix-variate logistic regression. It derives the fundamental error threshold on estimating low-rank coefficient matrices in the logistic regression problem by obtaining a lower bound on the minimax risk. The bound depends explicitly on the dimension and distribution of the covariates, the rank and energy of the coefficient matrix, and the number of samples. The resulting bound is proportional to the intrinsic degrees of freedom in the problem, which suggests the sample complexity of the low-rank matrix logistic regression problem can be lower than that for vectorized logistic regression. The proof techniques utilized in this work also set the stage for development of minimax lower bounds for tensor-variate logistic regression problems. △ Less

Submitted 28 January, 2022; v1 submitted 30 May, 2021; originally announced May 2021.

Comments: 8 pages; published in Proc. 55th Asilomar Conf. Signals, Systems, and Computers, Pacific Grove, CA, Oct. 31-Nov. 3, 2021

arXiv:2101.01300 [pdf, other]

doi 10.1016/j.sigpro.2021.108408

A Linearly Convergent Algorithm for Distributed Principal Component Analysis

Authors: Arpita Gang, Waheed U. Bajwa

Abstract: Principal Component Analysis (PCA) is the workhorse tool for dimensionality reduction in this era of big data. While often overlooked, the purpose of PCA is not only to reduce data dimensionality, but also to yield features that are uncorrelated. Furthermore, the ever-increasing volume of data in the modern world often requires storage of data samples across multiple machines, which precludes the… ▽ More Principal Component Analysis (PCA) is the workhorse tool for dimensionality reduction in this era of big data. While often overlooked, the purpose of PCA is not only to reduce data dimensionality, but also to yield features that are uncorrelated. Furthermore, the ever-increasing volume of data in the modern world often requires storage of data samples across multiple machines, which precludes the use of centralized PCA algorithms. This paper focuses on the dual objective of PCA, namely, dimensionality reduction and decorrelation of features, but in a distributed setting. This requires estimating the eigenvectors of the data covariance matrix, as opposed to only estimating the subspace spanned by the eigenvectors, when data is distributed across a network of machines. Although a few distributed solutions to the PCA problem have been proposed recently, convergence guarantees and/or communications overhead of these solutions remain a concern. With an eye towards communications efficiency, this paper introduces a feedforward neural network-based one time-scale distributed PCA algorithm termed Distributed Sanger's Algorithm (DSA) that estimates the eigenvectors of the data covariance matrix when data is distributed across an undirected and arbitrarily connected network of machines. Furthermore, the proposed algorithm is shown to converge linearly to a neighborhood of the true solution. Numerical results are also provided to demonstrate the efficacy of the proposed solution. △ Less

Submitted 28 November, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

Comments: 34 pages; final version of journal paper accepted for publication in a special issue of EURASIP J. Signal Processing

arXiv:2005.08854 [pdf, other]

doi 10.1109/JPROC.2020.3021381

Scaling-up Distributed Processing of Data Streams for Machine Learning

Authors: Matthew Nokleby, Haroon Raja, Waheed U. Bajwa

Abstract: Emerging applications of machine learning in numerous areas involve continuous gathering of and learning from streams of data. Real-time incorporation of streaming data into the learned models is essential for improved inference in these applications. Further, these applications often involve data that are either inherently gathered at geographically distributed entities or that are intentionally… ▽ More Emerging applications of machine learning in numerous areas involve continuous gathering of and learning from streams of data. Real-time incorporation of streaming data into the learned models is essential for improved inference in these applications. Further, these applications often involve data that are either inherently gathered at geographically distributed entities or that are intentionally distributed across multiple machines for memory, computational, and/or privacy reasons. Training of models in this distributed, streaming setting requires solving stochastic optimization problems in a collaborative manner over communication links between the physical entities. When the streaming data rate is high compared to the processing capabilities of compute nodes and/or the rate of the communications links, this poses a challenging question: how can one best leverage the incoming data for distributed training under constraints on computing capabilities and/or communications rate? A large body of research has emerged in recent decades to tackle this and related problems. This paper reviews recently developed methods that focus on large-scale distributed stochastic optimization in the compute- and bandwidth-limited regime, with an emphasis on convergence analysis that explicitly accounts for the mismatch between computation, communication and streaming rates. In particular, it focuses on methods that solve: (i) distributed stochastic convex problems, and (ii) distributed principal component analysis, which is a nonconvex problem with geometric structure that permits global convergence. For such methods, the paper discusses recent advances in terms of distributed algorithmic designs when faced with high-rate streaming data. Further, it reviews guarantees underlying these methods, which show there exist regimes in which systems can learn from distributed, streaming data at order-optimal rates. △ Less

Submitted 31 August, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

Comments: 45 pages, 9 figures; preprint of a journal paper published in Proceedings of the IEEE (Special Issue on Optimization for Data-driven Learning and Control)

Journal ref: Proc. of the IEEE, vol. 108, no. 11, pp. 1984-2012, Nov. 2020

arXiv:2001.01017 [pdf, ps, other]

Distributed Stochastic Algorithms for High-rate Streaming Principal Component Analysis

Authors: Haroon Raja, Waheed U. Bajwa

Abstract: This paper considers the problem of estimating the principal eigenvector of a covariance matrix from independent and identically distributed data samples in streaming settings. The streaming rate of data in many contemporary applications can be high enough that a single processor cannot finish an iteration of existing methods for eigenvector estimation before a new sample arrives. This paper formu… ▽ More This paper considers the problem of estimating the principal eigenvector of a covariance matrix from independent and identically distributed data samples in streaming settings. The streaming rate of data in many contemporary applications can be high enough that a single processor cannot finish an iteration of existing methods for eigenvector estimation before a new sample arrives. This paper formulates and analyzes a distributed variant of the classical Krasulina's method (D-Krasulina) that can keep up with the high streaming rate of data by distributing the computational load across multiple processing nodes. The analysis shows that---under appropriate conditions---D-Krasulina converges to the principal eigenvector in an order-wise optimal manner; i.e., after receiving $M$ samples across all nodes, its estimation error can be $O(1/M)$. In order to reduce the network communication overhead, the paper also develops and analyzes a mini-batch extension of D-Krasulina, which is termed DM-Krasulina. The analysis of DM-Krasulina shows that it can also achieve order-optimal estimation error rates under appropriate conditions, even when some samples have to be discarded within the network due to communication latency. Finally, experiments are performed over synthetic and real-world data to validate the convergence behaviors of D-Krasulina and DM-Krasulina in high-rate streaming settings. △ Less

Submitted 3 January, 2020; originally announced January 2020.

Comments: 37 pages, 11 figures; preprint of a journal submission

arXiv:1911.03725 [pdf, other]

doi 10.1137/19M1299335

Tensor Regression Using Low-rank and Sparse Tucker Decompositions

Authors: Talal Ahmed, Haroon Raja, Waheed U. Bajwa

Abstract: This paper studies a tensor-structured linear regression model with a scalar response variable and tensor-structured predictors, such that the regression parameters form a tensor of order $d$ (i.e., a $d$-fold multiway array) in $\mathbb{R}^{n_1 \times n_2 \times \cdots \times n_d}$. It focuses on the task of estimating the regression tensor from $m$ realizations of the response variable and the p… ▽ More This paper studies a tensor-structured linear regression model with a scalar response variable and tensor-structured predictors, such that the regression parameters form a tensor of order $d$ (i.e., a $d$-fold multiway array) in $\mathbb{R}^{n_1 \times n_2 \times \cdots \times n_d}$. It focuses on the task of estimating the regression tensor from $m$ realizations of the response variable and the predictors where $m\ll n = \prod \nolimits_{i} n_i$. Despite the seeming ill-posedness of this problem, it can still be solved if the parameter tensor belongs to the space of sparse, low Tucker-rank tensors. Accordingly, the estimation procedure is posed as a non-convex optimization program over the space of sparse, low Tucker-rank tensors, and a tensor variant of projected gradient descent is proposed to solve the resulting non-convex problem. In addition, mathematical guarantees are provided that establish the proposed method linearly converges to an appropriate solution under a certain set of conditions. Further, an upper bound on sample complexity of tensor parameter estimation for the model under consideration is characterized for the special case when the individual (scalar) predictors independently draw values from a sub-Gaussian distribution. The sample complexity bound is shown to have a polylogarithmic dependence on $\bar{n} = \max \big\{n_i: i\in \{1,2,\ldots,d \} \big\}$ and, orderwise, it matches the bound one can obtain from a heuristic parameter counting argument. Finally, numerical experiments demonstrate the efficacy of the proposed tensor model and estimation method on a synthetic dataset and a collection of neuroimaging datasets pertaining to attention deficit hyperactivity disorder. Specifically, the proposed method exhibits better sample complexities on both synthetic and real datasets, demonstrating the usefulness of the model and the method in settings where $n \gg m$. △ Less

Submitted 20 July, 2020; v1 submitted 9 November, 2019; originally announced November 2019.

Comments: 28 pages, 5 figures, 2 tables; preprint of a journal paper published in SIAM Journal on Mathematics of Data Science

MSC Class: 41A52; 41A63; 62F10; 62J05

Journal ref: SIAM J. Math. Data Science, vol. 2, no. 4, pp. 944-966, 2020

arXiv:1908.08649 [pdf, other]

doi 10.1109/MSP.2020.2973345

Adversary-resilient Distributed and Decentralized Statistical Inference and Machine Learning: An Overview of Recent Advances Under the Byzantine Threat Model

Authors: Zhixiong Yang, Arpita Gang, Waheed U. Bajwa

Abstract: While the last few decades have witnessed a huge body of work devoted to inference and learning in distributed and decentralized setups, much of this work assumes a non-adversarial setting in which individual nodes---apart from occasional statistical failures---operate as intended within the algorithmic framework. In recent years, however, cybersecurity threats from malicious non-state actors and… ▽ More While the last few decades have witnessed a huge body of work devoted to inference and learning in distributed and decentralized setups, much of this work assumes a non-adversarial setting in which individual nodes---apart from occasional statistical failures---operate as intended within the algorithmic framework. In recent years, however, cybersecurity threats from malicious non-state actors and rogue entities have forced practitioners and researchers to rethink the robustness of distributed and decentralized algorithms against adversarial attacks. As a result, we now have a plethora of algorithmic approaches that guarantee robustness of distributed and/or decentralized inference and learning under different adversarial threat models. Driven in part by the world's growing appetite for data-driven decision making, however, securing of distributed/decentralized frameworks for inference and learning against adversarial threats remains a rapidly evolving research area. In this article, we provide an overview of some of the most recent developments in this area under the threat model of Byzantine attacks. △ Less

Submitted 1 June, 2020; v1 submitted 22 August, 2019; originally announced August 2019.

Comments: 24 pages, 6 figures, 2 tables; Published in IEEE Signal Processing Magazine, May 2020 (Special Issue on "Machine Learning From Distributed, Streaming Data")

Journal ref: IEEE Signal Processing Mag., vol. 37, no. 3, pp. 146-159, May 2020

arXiv:1908.08098 [pdf, other]

BRIDGE: Byzantine-resilient Decentralized Gradient Descent

Authors: Cheng Fang, Zhixiong Yang, Waheed U. Bajwa

Abstract: Machine learning has begun to play a central role in many applications. A multitude of these applications typically also involve datasets that are distributed across multiple computing devices/machines due to either design constraints (e.g., multiagent systems) or computational/privacy reasons (e.g., learning on smartphone data). Such applications often require the learning tasks to be carried out… ▽ More Machine learning has begun to play a central role in many applications. A multitude of these applications typically also involve datasets that are distributed across multiple computing devices/machines due to either design constraints (e.g., multiagent systems) or computational/privacy reasons (e.g., learning on smartphone data). Such applications often require the learning tasks to be carried out in a decentralized fashion, in which there is no central server that is directly connected to all nodes. In real-world decentralized settings, nodes are prone to undetected failures due to malfunctioning equipment, cyberattacks, etc., which are likely to crash non-robust learning algorithms. The focus of this paper is on robustification of decentralized learning in the presence of nodes that have undergone Byzantine failures. The Byzantine failure model allows faulty nodes to arbitrarily deviate from their intended behaviors, thereby ensuring designs of the most robust of algorithms. But the study of Byzantine resilience within decentralized learning, in contrast to distributed learning, is still in its infancy. In particular, existing Byzantine-resilient decentralized learning methods either do not scale well to large-scale machine learning models, or they lack statistical convergence guarantees that help characterize their generalization errors. In this paper, a scalable, Byzantine-resilient decentralized machine learning framework termed Byzantine-resilient decentralized gradient descent (BRIDGE) is introduced. Algorithmic and statistical convergence guarantees for one variant of BRIDGE are also provided in the paper for both strongly convex problems and a class of nonconvex problems. In addition, large-scale decentralized learning experiments are used to establish that the BRIDGE framework is scalable and it delivers competitive results for Byzantine-resilient convex and nonconvex learning. △ Less

Submitted 14 June, 2022; v1 submitted 21 August, 2019; originally announced August 2019.

Comments: 20 pages, 10 figures, 2 tables; some expanded discussion as well as additional numerical experiments using the CIFAR-10 dataset

arXiv:1908.00195 [pdf, other]

doi 10.1109/TCCN.2020.2990657

Learning-Aided Physical Layer Attacks Against Multicarrier Communications in IoT

Authors: Alireza Nooraiepour, Waheed U. Bajwa, Narayan B. Mandayam

Abstract: Internet-of-Things (IoT) devices that are limited in power and processing are susceptible to physical layer (PHY) spoofing (signal exploitation) attacks owing to their inability to implement a full-blown protocol stack for security. The overwhelming adoption of multicarrier techniques such as orthogonal frequency division multiplexing (OFDM) for the PHY layer makes IoT devices further vulnerable t… ▽ More Internet-of-Things (IoT) devices that are limited in power and processing are susceptible to physical layer (PHY) spoofing (signal exploitation) attacks owing to their inability to implement a full-blown protocol stack for security. The overwhelming adoption of multicarrier techniques such as orthogonal frequency division multiplexing (OFDM) for the PHY layer makes IoT devices further vulnerable to PHY spoofing attacks. These attacks which aim at injecting bogus/spurious data into the receiver, involve inferring transmission parameters and finding PHY characteristics of the transmitted signals so as to spoof the received signal. Non-contiguous (NC) OFDM systems have been argued to have low probability of exploitation (LPE) characteristics against classic attacks based on cyclostationary analysis, and the corresponding PHY has been deemed to be secure. However, with the advent of machine learning (ML) algorithms, adversaries can devise data-driven attacks to compromise such systems. It is in this vein that PHY spoofing performance of adversaries equipped with supervised and unsupervised ML tools are investigated in this paper. The supervised ML approach is based on deep neural networks (DNN) while the unsupervised one employs variational autoencoders (VAEs). In particular, VAEs are shown to be capable of learning representations from NC-OFDM signals related to their PHY characteristics such as frequency pattern and modulation scheme, which are useful for PHY spoofing. In addition, a new metric based on the disentanglement principle is proposed to measure the quality of such learned representations. Simulation results demonstrate that the performance of the spoofing adversaries highly depends on the subcarriers' allocation patterns. Particularly, it is shown that utilizing a random subcarrier occupancy pattern secures NC-OFDM systems against ML-based attacks. △ Less

Submitted 4 July, 2020; v1 submitted 31 July, 2019; originally announced August 2019.

Comments: 15 pages; 20 figures; 3 tables; preprint of a paper accepted for publication in IEEE Trans. Cognitive Commun. Netw

Journal ref: IEEE Trans. Cognitive Commun. Netw., vol. 7, no. 1, pp. 239-254, Mar. 2021

arXiv:1903.09284 [pdf, other]

doi 10.1109/TSP.2019.2952046

Learning Mixtures of Separable Dictionaries for Tensor Data: Analysis and Algorithms

Authors: Mohsen Ghassemi, Zahra Shakeri, Anand D. Sarwate, Waheed U. Bajwa

Abstract: This work addresses the problem of learning sparse representations of tensor data using structured dictionary learning. It proposes learning a mixture of separable dictionaries to better capture the structure of tensor data by generalizing the separable dictionary learning model. Two different approaches for learning mixture of separable dictionaries are explored and sufficient conditions for loca… ▽ More This work addresses the problem of learning sparse representations of tensor data using structured dictionary learning. It proposes learning a mixture of separable dictionaries to better capture the structure of tensor data by generalizing the separable dictionary learning model. Two different approaches for learning mixture of separable dictionaries are explored and sufficient conditions for local identifiability of the underlying dictionary are derived in each case. Moreover, computational algorithms are developed to solve the problem of learning mixture of separable dictionaries in both batch and online settings. Numerical experiments are used to show the usefulness of the proposed model and the efficacy of the developed algorithms. △ Less

Submitted 13 June, 2020; v1 submitted 21 March, 2019; originally announced March 2019.

Comments: 18 pages, 4 figures, 3 tables; Published in IEEE Trans. Signal Processing

Journal ref: IEEE Trans. Signal Processing, vol. 68, pp. 33-48, 2020

arXiv:1712.03471 [pdf, other]

doi 10.1109/JSTSP.2018.2838092

Identifiability of Kronecker-structured Dictionaries for Tensor Data

Authors: Zahra Shakeri, Anand D. Sarwate, Waheed U. Bajwa

Abstract: This paper derives sufficient conditions for local recovery of coordinate dictionaries comprising a Kronecker-structured dictionary that is used for representing $K$th-order tensor data. Tensor observations are assumed to be generated from a Kronecker-structured dictionary multiplied by sparse coefficient tensors that follow the separable sparsity model. This work provides sufficient conditions on… ▽ More This paper derives sufficient conditions for local recovery of coordinate dictionaries comprising a Kronecker-structured dictionary that is used for representing $K$th-order tensor data. Tensor observations are assumed to be generated from a Kronecker-structured dictionary multiplied by sparse coefficient tensors that follow the separable sparsity model. This work provides sufficient conditions on the underlying coordinate dictionaries, coefficient and noise distributions, and number of samples that guarantee recovery of the individual coordinate dictionaries up to a specified error, as a local minimum of the objective function, with high probability. In particular, the sample complexity to recover $K$ coordinate dictionaries with dimensions $m_k \times p_k$ up to estimation error $\varepsilon_k$ is shown to be $\max_{k \in [K]}\mathcal{O}(m_kp_k^3\varepsilon_k^{-2})$. △ Less

Submitted 25 May, 2018; v1 submitted 10 December, 2017; originally announced December 2017.

Comments: 16 pages, to appear in IEEE Journal of Special Topics in Signal Processing

Journal ref: IEEE J. Sel. Topics Signal Processing, vol. 12, no. 5, pp. 1047-1062, Oct. 2018

arXiv:1711.04887 [pdf, other]

STARK: Structured Dictionary Learning Through Rank-one Tensor Recovery

Authors: Mohsen Ghassemi, Zahra Shakeri, Anand D. Sarwate, Waheed U. Bajwa

Abstract: In recent years, a class of dictionaries have been proposed for multidimensional (tensor) data representation that exploit the structure of tensor data by imposing a Kronecker structure on the dictionary underlying the data. In this work, a novel algorithm called "STARK" is provided to learn Kronecker structured dictionaries that can represent tensors of any order. By establishing that the Kroneck… ▽ More In recent years, a class of dictionaries have been proposed for multidimensional (tensor) data representation that exploit the structure of tensor data by imposing a Kronecker structure on the dictionary underlying the data. In this work, a novel algorithm called "STARK" is provided to learn Kronecker structured dictionaries that can represent tensors of any order. By establishing that the Kronecker product of any number of matrices can be rearranged to form a rank-1 tensor, we show that Kronecker structure can be enforced on the dictionary by solving a rank-1 tensor recovery problem. Because rank-1 tensor recovery is a challenging nonconvex problem, we resort to solving a convex relaxation of this problem. Empirical experiments on synthetic and real data show promising results for our proposed algorithm. △ Less

Submitted 13 November, 2017; originally announced November 2017.

arXiv:1708.08155 [pdf, other]

doi 10.1109/TSIPN.2019.2928176

ByRDiE: Byzantine-resilient distributed coordinate descent for decentralized learning

Authors: Zhixiong Yang, Waheed U. Bajwa

Abstract: Distributed machine learning algorithms enable learning of models from datasets that are distributed over a network without gathering the data at a centralized location. While efficient distributed algorithms have been developed under the assumption of faultless networks, failures that can render these algorithms nonfunctional occur frequently in the real world. This paper focuses on the problem o… ▽ More Distributed machine learning algorithms enable learning of models from datasets that are distributed over a network without gathering the data at a centralized location. While efficient distributed algorithms have been developed under the assumption of faultless networks, failures that can render these algorithms nonfunctional occur frequently in the real world. This paper focuses on the problem of Byzantine failures, which are the hardest to safeguard against in distributed algorithms. While Byzantine fault tolerance has a rich history, existing work does not translate into efficient and practical algorithms for high-dimensional learning in fully distributed (also known as decentralized) settings. In this paper, an algorithm termed Byzantine-resilient distributed coordinate descent (ByRDiE) is developed and analyzed that enables distributed learning in the presence of Byzantine failures. Theoretical analysis (convex settings) and numerical experiments (convex and nonconvex settings) highlight its usefulness for high-dimensional distributed learning in the presence of Byzantine failures. △ Less

Submitted 5 July, 2019; v1 submitted 27 August, 2017; originally announced August 2017.

Comments: Preprint of a paper accepted into IEEE Transactions on Signal and Information Processing Over Networks; 16 pages, 5 figures, and 1 table

Journal ref: IEEE Trans. Signal Inform. Proc. over Netw., vol. 5, no. 4, pp. 611-627, Dec. 2019

arXiv:1708.06077 [pdf, other]

doi 10.1016/j.sigpro.2019.01.018

ExSIS: Extended Sure Independence Screening for Ultrahigh-dimensional Linear Models

Authors: Talal Ahmed, Waheed U. Bajwa

Abstract: Statistical inference can be computationally prohibitive in ultrahigh-dimensional linear models. Correlation-based variable screening, in which one leverages marginal correlations for removal of irrelevant variables from the model prior to statistical inference, can be used to overcome this challenge. Prior works on correlation-based variable screening either impose statistical priors on the linea… ▽ More Statistical inference can be computationally prohibitive in ultrahigh-dimensional linear models. Correlation-based variable screening, in which one leverages marginal correlations for removal of irrelevant variables from the model prior to statistical inference, can be used to overcome this challenge. Prior works on correlation-based variable screening either impose statistical priors on the linear model or assume specific post-screening inference methods. This paper first extends the analysis of correlation-based variable screening to arbitrary linear models and post-screening inference techniques. In particular, (i) it shows that a condition---termed the screening condition---is sufficient for successful correlation-based screening of linear models, and (ii) it provides insights into the dependence of marginal correlation-based screening on different problem parameters. Numerical experiments confirm that these insights are not mere artifacts of analysis; rather, they are reflective of the challenges associated with marginal correlation-based variable screening. Second, the paper explicitly derives the screening condition for arbitrary (random or deterministic) linear models and, in the process, it establishes that---under appropriate conditions---it is possible to reduce the dimension of an ultrahigh-dimensional, arbitrary linear model to almost the sample size even when the number of active variables scales almost linearly with the sample size. Third, it specializes the screening condition to sub-Gaussian linear models and contrasts the final results to those existing in the literature. This specialization formally validates the claim that the main result of this paper generalizes existing ones on correlation-based screening. △ Less

Submitted 4 July, 2020; v1 submitted 20 August, 2017; originally announced August 2017.

Comments: 30 pages; 3 figures and 1 table; preprint of a journal publication

Journal ref: EURASIP J. Signal Processing, vol. 159, pp. 33-48, Jun. 2019

arXiv:1704.07888 [pdf, other]

doi 10.1109/TSIPN.2018.2866320

Stochastic Optimization from Distributed, Streaming Data in Rate-limited Networks

Authors: Matthew Nokleby, Waheed U. Bajwa

Abstract: Motivated by machine learning applications in networks of sensors, internet-of-things (IoT) devices, and autonomous agents, we propose techniques for distributed stochastic convex learning from high-rate data streams. The setup involves a network of nodes---each one of which has a stream of data arriving at a constant rate---that solve a stochastic convex optimization problem by collaborating with… ▽ More Motivated by machine learning applications in networks of sensors, internet-of-things (IoT) devices, and autonomous agents, we propose techniques for distributed stochastic convex learning from high-rate data streams. The setup involves a network of nodes---each one of which has a stream of data arriving at a constant rate---that solve a stochastic convex optimization problem by collaborating with each other over rate-limited communication links. To this end, we present and analyze two algorithms---termed distributed stochastic approximation mirror descent (D-SAMD) and accelerated distributed stochastic approximation mirror descent (AD-SAMD)---that are based on two stochastic variants of mirror descent and in which nodes collaborate via approximate averaging of the local, noisy subgradients using distributed consensus. Our main contributions are (i) bounds on the convergence rates of D-SAMD and AD-SAMD in terms of the number of nodes, network topology, and ratio of the data streaming and communication rates, and (ii) sufficient conditions for order-optimum convergence of these algorithms. In particular, we show that for sufficiently well-connected networks, distributed learning schemes can obtain order-optimum convergence even if the communications rate is small. Further we find that the use of accelerated methods significantly enlarges the regime in which order-optimum convergence is achieved; this is in contrast to the centralized setting, where accelerated methods usually offer only a modest improvement. Finally, we demonstrate the effectiveness of the proposed algorithms using numerical experiments. △ Less

Submitted 6 August, 2018; v1 submitted 25 April, 2017; originally announced April 2017.

Comments: 16 pages, 6 figures; Accepted for publication in IEEE Transactions on Signal and Information Processing over Networks

Journal ref: Published in IEEE Trans. Signal Inform. Proc. over Netw., vol. 5, no. 1, pp. 152-167, Mar. 2019

arXiv:1612.07857 [pdf, other]

doi 10.7282/t3-t7fe-4a02

Human Action Attribute Learning From Video Data Using Low-Rank Representations

Authors: Tong Wu, Prudhvi Gurram, Raghuveer M. Rao, Waheed U. Bajwa

Abstract: Representation of human actions as a sequence of human body movements or action attributes enables the development of models for human activity recognition and summarization. We present an extension of the low-rank representation (LRR) model, termed the clustering-aware structure-constrained low-rank representation (CS-LRR) model, for unsupervised learning of human action attributes from video dat… ▽ More Representation of human actions as a sequence of human body movements or action attributes enables the development of models for human activity recognition and summarization. We present an extension of the low-rank representation (LRR) model, termed the clustering-aware structure-constrained low-rank representation (CS-LRR) model, for unsupervised learning of human action attributes from video data. Our model is based on the union-of-subspaces (UoS) framework, and integrates spectral clustering into the LRR optimization problem for better subspace clustering results. We lay out an efficient linear alternating direction method to solve the CS-LRR optimization problem. We also introduce a hierarchical subspace clustering approach, termed hierarchical CS-LRR, to learn the attributes without the need for a priori specification of their number. By visualizing and labeling these action attributes, the hierarchical model can be used to semantically summarize long video sequences of human actions at multiple resolutions. A human action or activity can also be uniquely represented as a sequence of transitions from one action attribute to another, which can then be used for human action recognition. We demonstrate the effectiveness of the proposed model for semantic summarization and action recognition through comprehensive experiments on five real-world human action datasets. △ Less

Submitted 4 July, 2020; v1 submitted 22 December, 2016; originally announced December 2016.

Comments: 26 pages; 8 figures; 2 tables; Rutgers University Technical Report #2020-07-001

Report number: Rutgers University Technical Report #2020-07-001

arXiv:1605.05284 [pdf, other]

doi 10.1109/ISIT.2016.7541479

Minimax Lower Bounds for Kronecker-Structured Dictionary Learning

Authors: Zahra Shakeri, Waheed U. Bajwa, Anand D. Sarwate

Abstract: Dictionary learning is the problem of estimating the collection of atomic elements that provide a sparse representation of measured/collected signals or data. This paper finds fundamental limits on the sample complexity of estimating dictionaries for tensor data by proving a lower bound on the minimax risk. This lower bound depends on the dimensions of the tensor and parameters of the generative m… ▽ More Dictionary learning is the problem of estimating the collection of atomic elements that provide a sparse representation of measured/collected signals or data. This paper finds fundamental limits on the sample complexity of estimating dictionaries for tensor data by proving a lower bound on the minimax risk. This lower bound depends on the dimensions of the tensor and parameters of the generative model. The focus of this paper is on second-order tensor data, with the underlying dictionaries constructed by taking the Kronecker product of two smaller dictionaries and the observed data generated by sparse linear combinations of dictionary atoms observed through white Gaussian noise. In this regard, the paper provides a general lower bound on the minimax risk and also adapts the proof techniques for equivalent results using sparse and Gaussian coefficient models. The reported results suggest that the sample complexity of dictionary learning for tensor data can be significantly lower than that for unstructured data. △ Less

Submitted 17 May, 2016; originally announced May 2016.

Comments: 5 pages, 1 figure. To appear in 2016 IEEE International Symposium on Information Theory

Journal ref: Proc. IEEE Intl. Symp. Information Theory, Barcelona, Spain, Jul. 10-15, 2016, pp. 1148-1152

arXiv:1412.7839 [pdf, other]

doi 10.1109/TSP.2015.2472372

Cloud K-SVD: A Collaborative Dictionary Learning Algorithm for Big, Distributed Data

Authors: Haroon Raja, Waheed U. Bajwa

Abstract: This paper studies the problem of data-adaptive representations for big, distributed data. It is assumed that a number of geographically-distributed, interconnected sites have massive local data and they are interested in collaboratively learning a low-dimensional geometric structure underlying these data. In contrast to previous works on subspace-based data representations, this paper focuses on… ▽ More This paper studies the problem of data-adaptive representations for big, distributed data. It is assumed that a number of geographically-distributed, interconnected sites have massive local data and they are interested in collaboratively learning a low-dimensional geometric structure underlying these data. In contrast to previous works on subspace-based data representations, this paper focuses on the geometric structure of a union of subspaces (UoS). In this regard, it proposes a distributed algorithm---termed cloud K-SVD---for collaborative learning of a UoS structure underlying distributed data of interest. The goal of cloud K-SVD is to learn a common overcomplete dictionary at each individual site such that every sample in the distributed data can be represented through a small number of atoms of the learned dictionary. Cloud K-SVD accomplishes this goal without requiring exchange of individual samples between sites. This makes it suitable for applications where sharing of raw data is discouraged due to either privacy concerns or large volumes of data. This paper also provides an analysis of cloud K-SVD that gives insights into its properties as well as deviations of the dictionaries learned at individual sites from a centralized solution in terms of different measures of local/global data and topology of interconnections. Finally, the paper numerically illustrates the efficacy of cloud K-SVD on real and synthetic distributed data. △ Less

Submitted 17 August, 2015; v1 submitted 25 December, 2014; originally announced December 2014.

Comments: Accepted for Publication in IEEE Trans. Signal Processing (2015); 16 pages, 3 figures

Journal ref: IEEE Trans. Signal Processing, vol. 64, no. 1, pp. 173-188, Jan. 2016

arXiv:1412.6808 [pdf, other]

doi 10.1109/TSP.2015.2469637

Learning the nonlinear geometry of high-dimensional data: Models and algorithms

Authors: Tong Wu, Waheed U. Bajwa

Abstract: Modern information processing relies on the axiom that high-dimensional data lie near low-dimensional geometric structures. This paper revisits the problem of data-driven learning of these geometric structures and puts forth two new nonlinear geometric models for data describing "related" objects/phenomena. The first one of these models straddles the two extremes of the subspace model and the unio… ▽ More Modern information processing relies on the axiom that high-dimensional data lie near low-dimensional geometric structures. This paper revisits the problem of data-driven learning of these geometric structures and puts forth two new nonlinear geometric models for data describing "related" objects/phenomena. The first one of these models straddles the two extremes of the subspace model and the union-of-subspaces model, and is termed the metric-constrained union-of-subspaces (MC-UoS) model. The second one of these models---suited for data drawn from a mixture of nonlinear manifolds---generalizes the kernel subspace model, and is termed the metric-constrained kernel union-of-subspaces (MC-KUoS) model. The main contributions of this paper in this regard include the following. First, it motivates and formalizes the problems of MC-UoS and MC-KUoS learning. Second, it presents algorithms that efficiently learn an MC-UoS or an MC-KUoS underlying data of interest. Third, it extends these algorithms to the case when parts of the data are missing. Last, but not least, it reports the outcomes of a series of numerical experiments involving both synthetic and real data that demonstrate the superiority of the proposed geometric models and learning algorithms over existing approaches in the literature. These experiments also help clarify the connections between this work and the literature on (subspace and kernel k-means) clustering. △ Less

Submitted 9 August, 2015; v1 submitted 21 December, 2014; originally announced December 2014.

Comments: Extended version of the journal paper accepted for publication in IEEE Trans. Signal Processing (20 pages, 7 figures, 4 tables)

Journal ref: IEEE Trans. Signal Processing, vol. 63, no. 23, pp. 6229-6244, Dec. 2015

arXiv:1409.3954 [pdf, ps, other]

doi 10.1109/TAES.2015.140452

MIMO-MC Radar: A MIMO Radar Approach Based on Matrix Completion

Authors: Shunqiao Sun, Waheed U. Bajwa, Athina P. Petropulu

Abstract: In a typical MIMO radar scenario, transmit nodes transmit orthogonal waveforms, while each receive node performs matched filtering with the known set of transmit waveforms, and forwards the results to the fusion center. Based on the data it receives from multiple antennas, the fusion center formulates a matrix, which, in conjunction with standard array processing schemes, such as MUSIC, leads to t… ▽ More In a typical MIMO radar scenario, transmit nodes transmit orthogonal waveforms, while each receive node performs matched filtering with the known set of transmit waveforms, and forwards the results to the fusion center. Based on the data it receives from multiple antennas, the fusion center formulates a matrix, which, in conjunction with standard array processing schemes, such as MUSIC, leads to target detection and parameter estimation. In MIMO radars with compressive sensing (MIMO-CS), the data matrix is formulated by each receive node forwarding a small number of compressively obtained samples. In this paper, it is shown that under certain conditions, in both sampling cases, the data matrix at the fusion center is low-rank, and thus can be recovered based on knowledge of a small subset of its entries via matrix completion (MC) techniques. Leveraging the low-rank property of that matrix, we propose a new MIMO radar approach, termed, MIMO-MC radar, in which each receive node either performs matched filtering with a small number of randomly selected dictionary waveforms or obtains sub-Nyquist samples of the received signal at random sampling instants, and forwards the results to a fusion center. Based on the received samples, and with knowledge of the sampling scheme, the fusion center partially fills the data matrix and subsequently applies MC techniques to estimate the full matrix. MIMO-MC radars share the advantages of the recently proposed MIMO-CS radars, i.e., high resolution with reduced amounts of data, but unlike MIMO-CS radars do not require grid discretization. The MIMO-MC radar concept is illustrated through a linear uniform array configuration, and its target estimation performance is demonstrated via simulations. △ Less

Submitted 13 September, 2014; originally announced September 2014.

Comments: 29 pages, 13 figures, IEEE Trans. on Aerospace and Electronic Systems

Journal ref: IEEE Trans. Aerosp. Electron. Syst., vol. 51, no. 3, pp. 1839-1852, Jul. 2015

arXiv:1302.4118 [pdf, ps, other]

doi 10.1109/ICASSP.2013.6638439

Target Estimation in Colocated MIMO Radar via Matrix Completion

Authors: Shunqiao Sun, Athina P. Petropulu, Waheed U. Bajwa

Abstract: We consider a colocated MIMO radar scenario, in which the receive antennas forward their measurements to a fusion center. Based on the received data, the fusion center formulates a matrix which is then used for target parameter estimation. When the receive antennas sample the target returns at Nyquist rate, and assuming that there are more receive antennas than targets, the data matrix at the fusi… ▽ More We consider a colocated MIMO radar scenario, in which the receive antennas forward their measurements to a fusion center. Based on the received data, the fusion center formulates a matrix which is then used for target parameter estimation. When the receive antennas sample the target returns at Nyquist rate, and assuming that there are more receive antennas than targets, the data matrix at the fusion center is low-rank. When each receive antenna sends to the fusion center only a small number of samples, along with the sample index, the receive data matrix has missing elements, corresponding to the samples that were not forwarded. Under certain conditions, matrix completion techniques can be applied to recover the full receive data matrix, which can then be used in conjunction with array processing techniques, e.g., MUSIC, to obtain target information. Numerical results indicate that good target recovery can be achieved with occupancy of the receive data matrix as low as 50%. △ Less

Submitted 25 March, 2013; v1 submitted 17 February, 2013; originally announced February 2013.

Comments: 5 pages, ICASSP 2013

Journal ref: Proc. IEEE Intl. Conf. Acoustics, Speech, and Signal Processing, Vancouver, Canada, May 26-31, 2013, pp. 4144-4148

arXiv:1210.2440 [pdf, ps, other]

doi 10.1109/Allerton.2012.6483259

Group Model Selection Using Marginal Correlations: The Good, the Bad and the Ugly

Authors: Waheed U. Bajwa, Dustin G. Mixon

Abstract: Group model selection is the problem of determining a small subset of groups of predictors (e.g., the expression data of genes) that are responsible for majority of the variation in a response variable (e.g., the malignancy of a tumor). This paper focuses on group model selection in high-dimensional linear models, in which the number of predictors far exceeds the number of samples of the response… ▽ More Group model selection is the problem of determining a small subset of groups of predictors (e.g., the expression data of genes) that are responsible for majority of the variation in a response variable (e.g., the malignancy of a tumor). This paper focuses on group model selection in high-dimensional linear models, in which the number of predictors far exceeds the number of samples of the response variable. Existing works on high-dimensional group model selection either require the number of samples of the response variable to be significantly larger than the total number of predictors contributing to the response or impose restrictive statistical priors on the predictors and/or nonzero regression coefficients. This paper provides comprehensive understanding of a low-complexity approach to group model selection that avoids some of these limitations. The proposed approach, termed Group Thresholding (GroTh), is based on thresholding of marginal correlations of groups of predictors with the response variable and is reminiscent of existing thresholding-based approaches in the literature. The most important contribution of the paper in this regard is relating the performance of GroTh to a polynomial-time verifiable property of the predictors for the general case of arbitrary (random or deterministic) predictors and arbitrary nonzero regression coefficients. △ Less

Submitted 8 October, 2012; originally announced October 2012.

Comments: Accepted for publication in Proc. 50th Annu. Allerton Conf. Communication, Control, and Computing, Monticello, IL, Oct. 1-5, 2012; 8 pages and 4 figures

Journal ref: Proc. 50th Annu. Allerton Conf. Communication, Control, and Computing, Monticello, IL, Oct. 1-5, 2012, pp. 494-501

arXiv:1209.3990 [pdf, other]

doi 10.1137/120891927

Level set estimation from projection measurements: Performance guarantees and fast computation

Authors: Kalyani Krishnamurthy, Waheed U. Bajwa, Rebecca Willett

Abstract: Estimation of the level set of a function (i.e., regions where the function exceeds some value) is an important problem with applications in digital elevation mapping, medical imaging, astronomy, etc. In many applications, the function of interest is not observed directly. Rather, it is acquired through (linear) projection measurements, such as tomographic projections, interferometric measurements… ▽ More Estimation of the level set of a function (i.e., regions where the function exceeds some value) is an important problem with applications in digital elevation mapping, medical imaging, astronomy, etc. In many applications, the function of interest is not observed directly. Rather, it is acquired through (linear) projection measurements, such as tomographic projections, interferometric measurements, coded-aperture measurements, and random projections associated with compressed sensing. This paper describes a new methodology for rapid and accurate estimation of the level set from such projection measurements. The key defining characteristic of the proposed method, called the projective level set estimator, is its ability to estimate the level set from projection measurements without an intermediate reconstruction step. This leads to significantly faster computation relative to heuristic "plug-in" methods that first estimate the function, typically with an iterative algorithm, and then threshold the result. The paper also includes a rigorous theoretical analysis of the proposed method, which utilizes the recent results from the non-asymptotic theory of random matrices results from the literature on concentration of measure and characterizes the estimator's performance in terms of geometry of the measurement operator and 1-norm of the discretized function. △ Less

Submitted 2 May, 2013; v1 submitted 18 September, 2012; originally announced September 2012.

Comments: 23 pages, 20 figures

MSC Class: 62; 68

Journal ref: SIAM J. Imaging Sciences, vol. 6, no. 4, pp. 2047-2074, Oct. 2013

arXiv:1104.4135 [pdf, ps, other]

doi 10.1093/biomet/ast028

Posterior consistency in linear models under shrinkage priors

Authors: Artin Armagan, David B. Dunson, Jaeyong Lee, Waheed U. Bajwa, Nate Strawn

Abstract: We investigate the asymptotic behavior of posterior distributions of regression coefficients in high-dimensional linear models as the number of dimensions grows with the number of observations. We show that the posterior distribution concentrates in neighborhoods of the true parameter under simple sufficient conditions. These conditions hold under popular shrinkage priors given some sparsity assum… ▽ More We investigate the asymptotic behavior of posterior distributions of regression coefficients in high-dimensional linear models as the number of dimensions grows with the number of observations. We show that the posterior distribution concentrates in neighborhoods of the true parameter under simple sufficient conditions. These conditions hold under popular shrinkage priors given some sparsity assumptions. △ Less

Submitted 19 May, 2013; v1 submitted 20 April, 2011; originally announced April 2011.

Comments: To appear in Biometrika

Journal ref: Biometrika, vol. 100, no. 4, pp. 1011-1018, Dec. 2013

arXiv:1006.0719 [pdf, ps, other]

doi 10.1109/JCN.2010.6388466

Why Gabor Frames? Two Fundamental Measures of Coherence and Their Role in Model Selection

Authors: Waheed U. Bajwa, Robert Calderbank, Sina Jafarpour

Abstract: This paper studies non-asymptotic model selection for the general case of arbitrary design matrices and arbitrary nonzero entries of the signal. In this regard, it generalizes the notion of incoherence in the existing literature on model selection and introduces two fundamental measures of coherence---termed as the worst-case coherence and the average coherence---among the columns of a design matr… ▽ More This paper studies non-asymptotic model selection for the general case of arbitrary design matrices and arbitrary nonzero entries of the signal. In this regard, it generalizes the notion of incoherence in the existing literature on model selection and introduces two fundamental measures of coherence---termed as the worst-case coherence and the average coherence---among the columns of a design matrix. It utilizes these two measures of coherence to provide an in-depth analysis of a simple, model-order agnostic one-step thresholding (OST) algorithm for model selection and proves that OST is feasible for exact as well as partial model selection as long as the design matrix obeys an easily verifiable property. One of the key insights offered by the ensuing analysis in this regard is that OST can successfully carry out model selection even when methods based on convex optimization such as the lasso fail due to the rank deficiency of the submatrices of the design matrix. In addition, the paper establishes that if the design matrix has reasonably small worst-case and average coherence then OST performs near-optimally when either (i) the energy of any nonzero entry of the signal is close to the average signal energy per nonzero entry or (ii) the signal-to-noise ratio in the measurement system is not too high. Finally, two other key contributions of the paper are that (i) it provides bounds on the average coherence of Gaussian matrices and Gabor frames, and (ii) it extends the results on model selection using OST to low-complexity, model-order agnostic recovery of sparse signals with arbitrary nonzero entries. △ Less

Submitted 2 July, 2010; v1 submitted 3 June, 2010; originally announced June 2010.

Comments: 31 pages, 4 figures; This paper is a full-length journal version of a shorter paper that was presented at the IEEE International Symposium on Information Theory, Austin, TX, June 2010

Journal ref: J. Commun. Netw., vol. 12, no. 4, pp. 289-307, Aug. 2010

Showing 1–26 of 26 results for author: Bajwa, W U