Search | arXiv e-print repository

ATLAS: Learning to Optimally Memorize the Context at Test Time

Authors: Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni

Abstract: Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural net… ▽ More Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80\% accuracy in 10M context length of BABILong benchmark. △ Less

Submitted 29 May, 2025; originally announced May 2025.

arXiv:2504.13173 [pdf, other]

It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

Authors: Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni

Abstract: Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias-the natural tendency to prioritize certain events or stimuli-we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associati… ▽ More Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias-the natural tendency to prioritize certain events or stimuli-we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules that learn a mapping of keys and values using an internal objective, referred to as attentional bias. Surprisingly, we observed that most existing sequence models leverage either (1) dot-product similarity, or (2) L2 regression objectives as their attentional bias. Going beyond these objectives, we present a set of alternative attentional bias configurations along with their effective approximations to stabilize their training procedure. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization, providing a novel set of forget gates for sequence models. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on four choices of: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. Our experiments show different design choices in Miras yield models with varying strengths. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models. △ Less

Submitted 17 April, 2025; originally announced April 2025.

arXiv:2502.17607 [pdf, other]

Synthetic Text Generation for Training Large Language Models via Gradient Matching

Authors: Dang Nguyen, Zeman Li, Mohammadhossein Bateni, Vahab Mirrokni, Meisam Razaviyayn, Baharan Mirzasoleiman

Abstract: Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the f… ▽ More Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that guarantees the convergence and performance of LLMs during fine-tuning on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text can guarantee convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data. Experiments on various classification tasks confirm the effectiveness of our proposed approach. △ Less

Submitted 24 February, 2025; originally announced February 2025.

Comments: 15 pages, 5 figures, 4 tables

arXiv:2502.06244 [pdf, other]

PiKE: Adaptive Data Mixing for Multi-Task Learning Under Low Gradient Conflicts

Authors: Zeman Li, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni

Abstract: Modern machine learning models are trained on diverse datasets and tasks to improve generalization. A key challenge in multitask learning is determining the optimal data mixing and sampling strategy across different data sources. Prior research in this multi-task learning setting has primarily focused on mitigating gradient conflicts between tasks. However, we observe that many real-world multitas… ▽ More Modern machine learning models are trained on diverse datasets and tasks to improve generalization. A key challenge in multitask learning is determining the optimal data mixing and sampling strategy across different data sources. Prior research in this multi-task learning setting has primarily focused on mitigating gradient conflicts between tasks. However, we observe that many real-world multitask learning scenarios-such as multilingual training and multi-domain learning in large foundation models-exhibit predominantly positive task interactions with minimal or no gradient conflict. Building on this insight, we introduce PiKE (Positive gradient interaction-based K-task weights Estimator), an adaptive data mixing algorithm that dynamically adjusts task contributions throughout training. PiKE optimizes task sampling to minimize overall loss, effectively leveraging positive gradient interactions with almost no additional computational overhead. We establish theoretical convergence guarantees for PiKE and demonstrate its superiority over static and non-adaptive mixing strategies. Additionally, we extend PiKE to promote fair learning across tasks, ensuring balanced progress and preventing task underrepresentation. Empirical evaluations on large-scale language model pretraining show that PiKE consistently outperforms existing heuristic and static mixing strategies, leading to faster convergence and improved downstream task performance. △ Less

Submitted 10 February, 2025; originally announced February 2025.

arXiv:2501.19287 [pdf, other]

Differentially Private In-context Learning via Sampling Few-shot Mixed with Zero-shot Outputs

Authors: James Flemings, Haosheng Gan, Hongyi Li, Meisam Razaviyayn, Murali Annavaram

Abstract: In-context learning (ICL) has shown promising improvement in downstream task adaptation of LLMs by augmenting prompts with relevant input-output examples (demonstrations). However, the ICL demonstrations can contain privacy-sensitive information, which can be leaked and/or regurgitated by the LLM output. Differential Privacy (DP), a widely adopted privacy safeguard, has emerged to mitigate this pr… ▽ More In-context learning (ICL) has shown promising improvement in downstream task adaptation of LLMs by augmenting prompts with relevant input-output examples (demonstrations). However, the ICL demonstrations can contain privacy-sensitive information, which can be leaked and/or regurgitated by the LLM output. Differential Privacy (DP), a widely adopted privacy safeguard, has emerged to mitigate this privacy leakage, with recent work demonstrating strong privacy-utility tradeoffs in classification tasks for ICL. However, generation tasks for ICL are challenging due to the high-dimensional output space of open-ended generation. To this end, we propose $\texttt{dps-mozo}$, Differentially Private Sampling by Mixing One-shot with Zero-shot Outputs, a decoding framework that generates DP text by sampling from the product of multiple one-shot outputs mixed with a zero-shot output. This mixing effectively reduces the amount of information that can be leaked by each demonstration. By utilizing the inherent randomness in sampling from the mixed distributions, we can achieve DP without adding noise, thereby improving the privacy-utility tradeoff. Our experimental evaluations show $\texttt{dps-mozo}$ can achieve a strong privacy guarantee, $ε=2$, with minimal utility degradation compared to non-private few-shot learning, $\textbf{0.3}$% ROUGE-L F1 score decrease on the SAMSum dataset with Gemma 2 2B. △ Less

Submitted 31 January, 2025; originally announced January 2025.

arXiv:2412.18164 [pdf, ps, other]

Stochastic Control for Fine-tuning Diffusion Models: Optimality, Regularity, and Convergence

Authors: Yinbin Han, Meisam Razaviyayn, Renyuan Xu

Abstract: Diffusion models have emerged as powerful tools for generative modeling, demonstrating exceptional capability in capturing target data distributions from large datasets. However, fine-tuning these massive models for specific downstream tasks, constraints, and human preferences remains a critical challenge. While recent advances have leveraged reinforcement learning algorithms to tackle this proble… ▽ More Diffusion models have emerged as powerful tools for generative modeling, demonstrating exceptional capability in capturing target data distributions from large datasets. However, fine-tuning these massive models for specific downstream tasks, constraints, and human preferences remains a critical challenge. While recent advances have leveraged reinforcement learning algorithms to tackle this problem, much of the progress has been empirical, with limited theoretical understanding. To bridge this gap, we propose a stochastic control framework for fine-tuning diffusion models. Building on denoising diffusion probabilistic models as the pre-trained reference dynamics, our approach integrates linear dynamics control with Kullback-Leibler regularization. We establish the well-posedness and regularity of the stochastic control problem and develop a policy iteration algorithm (PI-FT) for numerical solution. We show that PI-FT achieves global convergence at a linear rate. Unlike existing work that assumes regularities throughout training, we prove that the control and value sequences generated by the algorithm maintain the regularity. Additionally, we explore extensions of our framework to parametric settings and continuous-time formulations. △ Less

Submitted 23 December, 2024; originally announced December 2024.

Comments: 28 pages

arXiv:2411.07889 [pdf, other]

A Stochastic Optimization Framework for Private and Fair Learning From Decentralized Data

Authors: Devansh Gupta, A. S. Poornash, Andrew Lowy, Meisam Razaviyayn

Abstract: Machine learning models are often trained on sensitive data (e.g., medical records and race/gender) that is distributed across different "silos" (e.g., hospitals). These federated learning models may then be used to make consequential decisions, such as allocating healthcare resources. Two key challenges emerge in this setting: (i) maintaining the privacy of each person's data, even if other silos… ▽ More Machine learning models are often trained on sensitive data (e.g., medical records and race/gender) that is distributed across different "silos" (e.g., hospitals). These federated learning models may then be used to make consequential decisions, such as allocating healthcare resources. Two key challenges emerge in this setting: (i) maintaining the privacy of each person's data, even if other silos or an adversary with access to the central server tries to infer this data; (ii) ensuring that decisions are fair to different demographic groups (e.g., race/gender). In this paper, we develop a novel algorithm for private and fair federated learning (FL). Our algorithm satisfies inter-silo record-level differential privacy (ISRL-DP), a strong notion of private FL requiring that silo i's sent messages satisfy record-level differential privacy for all i. Our framework can be used to promote different fairness notions, including demographic parity and equalized odds. We prove that our algorithm converges under mild smoothness assumptions on the loss function, whereas prior work required strong convexity for convergence. As a byproduct of our analysis, we obtain the first convergence guarantee for ISRL-DP nonconvex-strongly concave min-max FL. Experiments demonstrate the state-of-the-art fairness-accuracy tradeoffs of our algorithm across different privacy levels. △ Less

Submitted 12 November, 2024; originally announced November 2024.

arXiv:2410.06441 [pdf, other]

Addax: Utilizing Zeroth-Order Gradients to Improve Memory Efficiency and Performance of SGD for Fine-Tuning Language Models

Authors: Zeman Li, Xinwei Zhang, Peilin Zhong, Yuan Deng, Meisam Razaviyayn, Vahab Mirrokni

Abstract: Fine-tuning language models (LMs) with the Adam optimizer often demands excessive memory, limiting accessibility. The "in-place" version of Stochastic Gradient Descent (IP-SGD) and Memory-Efficient Zeroth-order Optimizer (MeZO) have been proposed to address this. However, IP-SGD still requires substantial memory, and MeZO suffers from slow convergence and degraded final performance due to its zero… ▽ More Fine-tuning language models (LMs) with the Adam optimizer often demands excessive memory, limiting accessibility. The "in-place" version of Stochastic Gradient Descent (IP-SGD) and Memory-Efficient Zeroth-order Optimizer (MeZO) have been proposed to address this. However, IP-SGD still requires substantial memory, and MeZO suffers from slow convergence and degraded final performance due to its zeroth-order nature. This paper introduces Addax, a novel method that improves both memory efficiency and performance of IP-SGD by integrating it with MeZO. Specifically, Addax computes zeroth- or first-order gradients of data points in the minibatch based on their memory consumption, combining these gradient estimates to update directions. By computing zeroth-order gradients for data points that require more memory and first-order gradients for others, Addax overcomes the slow convergence of MeZO and the excessive memory requirement of IP-SGD. Additionally, the zeroth-order gradient acts as a regularizer for the first-order gradient, further enhancing the model's final performance. Theoretically, we establish the convergence of Addax under mild assumptions, demonstrating faster convergence and less restrictive hyper-parameter choices than MeZO. Our experiments with diverse LMs and tasks show that Addax consistently outperforms MeZO regarding accuracy and convergence speed while having a comparable memory footprint. When fine-tuning OPT-13B with one A100 GPU, on average, Addax outperforms MeZO in accuracy/F1 score by 14% and runs 15x faster while using memory similar to MeZO. In our experiments on the larger OPT-30B model, on average, Addax outperforms MeZO in terms of accuracy/F1 score by >16 and runs 30x faster on a single H100 GPU. Moreover, Addax surpasses the performance of standard fine-tuning approaches, such as IP-SGD and Adam, in most tasks with significantly less memory requirement. △ Less

Submitted 8 October, 2024; originally announced October 2024.

arXiv:2410.03883 [pdf, other]

DiSK: Differentially Private Optimizer with Simplified Kalman Filter for Noise Reduction

Authors: Xinwei Zhang, Zhiqi Bu, Borja Balle, Mingyi Hong, Meisam Razaviyayn, Vahab Mirrokni

Abstract: Differential privacy (DP) offers a robust framework for safeguarding individual data privacy. To utilize DP in training modern machine learning models, differentially private optimizers have been widely used in recent years. A popular approach to privatize an optimizer is to clip the individual gradients and add sufficiently large noise to the clipped gradient. This approach led to the development… ▽ More Differential privacy (DP) offers a robust framework for safeguarding individual data privacy. To utilize DP in training modern machine learning models, differentially private optimizers have been widely used in recent years. A popular approach to privatize an optimizer is to clip the individual gradients and add sufficiently large noise to the clipped gradient. This approach led to the development of DP optimizers that have comparable performance with their non-private counterparts in fine-tuning tasks or in tasks with a small number of training parameters. However, a significant performance drop is observed when these optimizers are applied to large-scale training. This degradation stems from the substantial noise injection required to maintain DP, which disrupts the optimizer's dynamics. This paper introduces DiSK, a novel framework designed to significantly enhance the performance of DP optimizers. DiSK employs Kalman filtering, a technique drawn from control and signal processing, to effectively denoise privatized gradients and generate progressively refined gradient estimations. To ensure practicality for large-scale training, we simplify the Kalman filtering process, minimizing its memory and computational demands. We establish theoretical privacy-utility trade-off guarantees for DiSK, and demonstrate provable improvements over standard DP optimizers like DPSGD in terms of iteration complexity upper-bound. Extensive experiments across diverse tasks, including vision tasks such as CIFAR-100 and ImageNet-1k and language fine-tuning tasks such as GLUE, E2E, and DART, validate the effectiveness of DiSK. The results showcase its ability to significantly improve the performance of DP optimizers, surpassing state-of-the-art results under the same privacy constraints on several benchmarks. △ Less

Submitted 29 April, 2025; v1 submitted 4 October, 2024; originally announced October 2024.

arXiv:2410.02016 [pdf, other]

Adaptively Private Next-Token Prediction of Large Language Models

Authors: James Flemings, Meisam Razaviyayn, Murali Annavaram

Abstract: As Large Language Models (LLMs) proliferate, developing privacy safeguards for these models is crucial. One popular safeguard involves training LLMs in a differentially private manner. However, such solutions are shown to be computationally expensive and detrimental to the utility of these models. Since LLMs are deployed on the cloud and thus only accessible via an API, a Machine Learning as a Ser… ▽ More As Large Language Models (LLMs) proliferate, developing privacy safeguards for these models is crucial. One popular safeguard involves training LLMs in a differentially private manner. However, such solutions are shown to be computationally expensive and detrimental to the utility of these models. Since LLMs are deployed on the cloud and thus only accessible via an API, a Machine Learning as a Service (MLaaS) provider can protect its downstream data by privatizing the predictions during the decoding process. However, the practicality of such solutions still largely lags behind DP training methods. One recent promising approach, Private Mixing of Ensemble Distributions (PMixED), avoids additive noise by sampling from the output distributions of private LLMs mixed with the output distribution of a public model. Yet, PMixED must satisfy a fixed privacy level for a given number of queries, which is difficult for an analyst to estimate before inference and, hence, does not scale. To this end, we relax the requirements to a more practical setting by introducing Adaptive PMixED (AdaPMixED), a private decoding framework based on PMixED that is adaptive to the private and public output distributions evaluated on a given input query. In this setting, we introduce a noisy screening mechanism that filters out queries with potentially expensive privacy loss, and a data-dependent analysis that exploits the divergence of the private and public output distributions in its privacy loss calculation. Our experimental evaluations demonstrate that our mechanism and analysis can reduce the privacy loss by 16x while preserving the utility over the original PMixED. Furthermore, performing 100K predictions with AdaPMixED still achieves strong utility and a reasonable data-dependent privacy loss of 5.25. △ Less

Submitted 2 October, 2024; originally announced October 2024.

arXiv:2408.13460 [pdf, other]

DOPPLER: Differentially Private Optimizers with Low-pass Filter for Privacy Noise Reduction

Authors: Xinwei Zhang, Zhiqi Bu, Mingyi Hong, Meisam Razaviyayn

Abstract: Privacy is a growing concern in modern deep-learning systems and applications. Differentially private (DP) training prevents the leakage of sensitive information in the collected training data from the trained machine learning models. DP optimizers, including DP stochastic gradient descent (DPSGD) and its variants, privatize the training procedure by gradient clipping and DP noise injection. Howev… ▽ More Privacy is a growing concern in modern deep-learning systems and applications. Differentially private (DP) training prevents the leakage of sensitive information in the collected training data from the trained machine learning models. DP optimizers, including DP stochastic gradient descent (DPSGD) and its variants, privatize the training procedure by gradient clipping and DP noise injection. However, in practice, DP models trained using DPSGD and its variants often suffer from significant model performance degradation. Such degradation prevents the application of DP optimization in many key tasks, such as foundation model pretraining. In this paper, we provide a novel signal processing perspective to the design and analysis of DP optimizers. We show that a ``frequency domain'' operation called low-pass filtering can be used to effectively reduce the impact of DP noise. More specifically, by defining the ``frequency domain'' for both the gradient and differential privacy (DP) noise, we have developed a new component, called DOPPLER. This component is designed for DP algorithms and works by effectively amplifying the gradient while suppressing DP noise within this frequency domain. As a result, it maintains privacy guarantees and enhances the quality of the DP-protected model. Our experiments show that the proposed DP optimizers with a low-pass filter outperform their counterparts without the filter by 3%-10% in test accuracy on various models and datasets. Both theoretical and practical evidence suggest that the DOPPLER is effective in closing the gap between DP and non-DP training. △ Less

Submitted 24 August, 2024; originally announced August 2024.

arXiv:2403.15638 [pdf, other]

Differentially Private Next-Token Prediction of Large Language Models

Authors: James Flemings, Meisam Razaviyayn, Murali Annavaram

Abstract: Ensuring the privacy of Large Language Models (LLMs) is becoming increasingly important. The most widely adopted technique to accomplish this is DP-SGD, which trains a model to guarantee Differential Privacy (DP). However, DP-SGD overestimates an adversary's capabilities in having white box access to the model and, as a result, causes longer training times and larger memory usage than SGD. On the… ▽ More Ensuring the privacy of Large Language Models (LLMs) is becoming increasingly important. The most widely adopted technique to accomplish this is DP-SGD, which trains a model to guarantee Differential Privacy (DP). However, DP-SGD overestimates an adversary's capabilities in having white box access to the model and, as a result, causes longer training times and larger memory usage than SGD. On the other hand, commercial LLM deployments are predominantly cloud-based; hence, adversarial access to LLMs is black-box. Motivated by these observations, we present Private Mixing of Ensemble Distributions (PMixED): a private prediction protocol for next-token prediction that utilizes the inherent stochasticity of next-token sampling and a public model to achieve Differential Privacy. We formalize this by introducing RD-mollifers which project each of the model's output distribution from an ensemble of fine-tuned LLMs onto a set around a public LLM's output distribution, then average the projected distributions and sample from it. Unlike DP-SGD which needs to consider the model architecture during training, PMixED is model agnostic, which makes PMixED a very appealing solution for current deployments. Our results show that PMixED achieves a stronger privacy guarantee than sample-level privacy and outperforms DP-SGD for privacy $ε= 8$ on large-scale datasets. Thus, PMixED offers a practical alternative to DP training methods for achieving strong generative utility without compromising privacy. △ Less

Submitted 26 April, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

arXiv:2401.15604 [pdf, ps, other]

Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization

Authors: Yinbin Han, Meisam Razaviyayn, Renyuan Xu

Abstract: Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness. A key component of these models is to learn the score function through score matching. Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy. As a first… ▽ More Diffusion models have emerged as a powerful tool rivaling GANs in generating high-quality samples with improved fidelity, flexibility, and robustness. A key component of these models is to learn the score function through score matching. Despite empirical success on various tasks, it remains unclear whether gradient-based algorithms can learn the score function with a provable accuracy. As a first step toward answering this question, this paper establishes a mathematical framework for analyzing score estimation using neural networks trained by gradient descent. Our analysis covers both the optimization and the generalization aspects of the learning procedure. In particular, we propose a parametric form to formulate the denoising score-matching problem as a regression with noisy labels. Compared to the standard supervised learning setup, the score-matching problem introduces distinct challenges, including unbounded input, vector-valued output, and an additional time variable, preventing existing techniques from being applied directly. In this paper, we show that with proper designs, the evolution of neural networks during training can be accurately modeled by a series of kernel regression tasks. Furthermore, by applying an early-stopping rule for gradient descent and leveraging recent developments in neural tangent kernels, we establish the first generalization error (sample complexity) bounds for learning the score function with neural networks, despite the presence of noise in the observations. Our analysis is grounded in a novel parametric form of the neural network and an innovative connection between score matching and regression analysis, facilitating the application of advanced statistical and optimization techniques. △ Less

Submitted 12 March, 2024; v1 submitted 28 January, 2024; originally announced January 2024.

Comments: 39 pages

arXiv:2312.03259 [pdf, other]

f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization

Authors: Sina Baharlouei, Shivam Patel, Meisam Razaviyayn

Abstract: Training and deploying machine learning models that meet fairness criteria for protected groups are fundamental in modern artificial intelligence. While numerous constraints and regularization terms have been proposed in the literature to promote fairness in machine learning tasks, most of these methods are not amenable to stochastic optimization due to the complex and nonlinear structure of const… ▽ More Training and deploying machine learning models that meet fairness criteria for protected groups are fundamental in modern artificial intelligence. While numerous constraints and regularization terms have been proposed in the literature to promote fairness in machine learning tasks, most of these methods are not amenable to stochastic optimization due to the complex and nonlinear structure of constraints and regularizers. Here, the term "stochastic" refers to the ability of the algorithm to work with small mini-batches of data. Motivated by the limitation of existing literature, this paper presents a unified stochastic optimization framework for fair empirical risk minimization based on f-divergence measures (f-FERM). The proposed stochastic algorithm enjoys theoretical convergence guarantees. In addition, our experiments demonstrate the superiority of fairness-accuracy tradeoffs offered by f-FERM for almost all batch sizes (ranging from full-batch to batch size of one). Moreover, we show that our framework can be extended to the case where there is a distribution shift from training to the test data. Our extension is based on a distributionally robust optimization reformulation of f-FERM objective under $L_p$ norms as uncertainty sets. Again, in this distributionally robust setting, f-FERM not only enjoys theoretical convergence guarantees but also outperforms other baselines in the literature in the tasks involving distribution shifts. An efficient stochastic implementation of $f$-FERM is publicly available. △ Less

Submitted 7 April, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

Comments: 24 Pages,5 figures

Journal ref: ICLR 2024

arXiv:2312.02341 [pdf, other]

Incentive Systems for Fleets of New Mobility Services

Authors: Ali Ghafelebashi, Meisam Razaviyayn, Maged Dessouky

Abstract: Traffic congestion has become an inevitable challenge in large cities due to population increases and expansion of urban areas. Various approaches are introduced to mitigate traffic issues, encompassing from expanding the road infrastructure to employing demand management. Congestion pricing and incentive schemes are extensively studied for traffic control in traditional networks where each driver… ▽ More Traffic congestion has become an inevitable challenge in large cities due to population increases and expansion of urban areas. Various approaches are introduced to mitigate traffic issues, encompassing from expanding the road infrastructure to employing demand management. Congestion pricing and incentive schemes are extensively studied for traffic control in traditional networks where each driver is a network "player". In this setup, drivers' "selfish" behavior hinders the network from reaching a socially optimal state. In future mobility services, on the other hand, a large portion of drivers/vehicles may be controlled by a small number of companies/organizations. In such a system, offering incentives to organizations can potentially be much more effective in reducing traffic congestion rather than offering incentives directly to drivers. This paper studies the problem of offering incentives to organizations to change the behavior of their individual drivers (or individuals relying on the organization's services). We developed a model where incentives are offered to each organization based on the aggregated travel time loss across all drivers in that organization. Such an incentive offering mechanism requires solving a large-scale optimization problem to minimize the system-level travel time. We propose an efficient algorithm for solving this optimization problem. Numerous experiments on Los Angeles County traffic data reveal the ability of our method to reduce system-level travel time by up to 6.9%. Moreover, our experiments demonstrate that incentivizing organizations can be up to 8 times more efficient than incentivizing individual drivers in terms of incentivization monetary cost. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: 23 pages, 12 figures. arXiv admin note: text overlap with arXiv:2204.07306

arXiv:2309.11682 [pdf, other]

Dr. FERMI: A Stochastic Distributionally Robust Fair Empirical Risk Minimization Framework

Authors: Sina Baharlouei, Meisam Razaviyayn

Abstract: While training fair machine learning models has been studied extensively in recent years, most developed methods rely on the assumption that the training and test data have similar distributions. In the presence of distribution shifts, fair models may behave unfairly on test data. There have been some developments for fair learning robust to distribution shifts to address this shortcoming. However… ▽ More While training fair machine learning models has been studied extensively in recent years, most developed methods rely on the assumption that the training and test data have similar distributions. In the presence of distribution shifts, fair models may behave unfairly on test data. There have been some developments for fair learning robust to distribution shifts to address this shortcoming. However, most proposed solutions are based on the assumption of having access to the causal graph describing the interaction of different features. Moreover, existing algorithms require full access to data and cannot be used when small batches are used (stochastic/batch implementation). This paper proposes the first stochastic distributionally robust fairness framework with convergence guarantees that do not require knowledge of the causal graph. More specifically, we formulate the fair inference in the presence of the distribution shift as a distributionally robust optimization problem under $L_p$ norm uncertainty sets with respect to the Exponential Renyi Mutual Information (ERMI) as the measure of fairness violation. We then discuss how the proposed method can be implemented in a stochastic fashion. We have evaluated the presented framework's performance and efficiency through extensive experiments on real datasets consisting of distribution shifts. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: 22 pages, 3 figures

arXiv:2306.15056 [pdf, other]

Optimal Differentially Private Model Training with Public Data

Authors: Andrew Lowy, Zeman Li, Tianjian Huang, Meisam Razaviyayn

Abstract: Differential privacy (DP) ensures that training a machine learning model does not leak private data. In practice, we may have access to auxiliary public data that is free of privacy concerns. In this work, we assume access to a given amount of public data and settle the following fundamental open questions: 1. What is the optimal (worst-case) error of a DP model trained over a private data set whi… ▽ More Differential privacy (DP) ensures that training a machine learning model does not leak private data. In practice, we may have access to auxiliary public data that is free of privacy concerns. In this work, we assume access to a given amount of public data and settle the following fundamental open questions: 1. What is the optimal (worst-case) error of a DP model trained over a private data set while having access to side public data? 2. How can we harness public data to improve DP model training in practice? We consider these questions in both the local and central models of pure and approximate DP. To answer the first question, we prove tight (up to log factors) lower and upper bounds that characterize the optimal error rates of three fundamental problems: mean estimation, empirical risk minimization, and stochastic convex optimization. We show that the optimal error rates can be attained (up to log factors) by either discarding private data and training a public model, or treating public data like it is private and using an optimal DP algorithm. To address the second question, we develop novel algorithms that are "even more optimal" (i.e. better constants) than the asymptotically optimal approaches described above. For local DP mean estimation, our algorithm is optimal including constants. Empirically, our algorithms show benefits over the state-of-the-art. △ Less

Submitted 9 September, 2024; v1 submitted 26 June, 2023; originally announced June 2023.

Comments: ICML 2024

arXiv:2306.13753 [pdf, ps, other]

Four Axiomatic Characterizations of the Integrated Gradients Attribution Method

Authors: Daniel Lundstrom, Meisam Razaviyayn

Abstract: Deep neural networks have produced significant progress among machine learning models in terms of accuracy and functionality, but their inner workings are still largely unknown. Attribution methods seek to shine a light on these "black box" models by indicating how much each input contributed to a model's outputs. The Integrated Gradients (IG) method is a state of the art baseline attribution meth… ▽ More Deep neural networks have produced significant progress among machine learning models in terms of accuracy and functionality, but their inner workings are still largely unknown. Attribution methods seek to shine a light on these "black box" models by indicating how much each input contributed to a model's outputs. The Integrated Gradients (IG) method is a state of the art baseline attribution method in the axiomatic vein, meaning it is designed to conform to particular principles of attributions. We present four axiomatic characterizations of IG, establishing IG as the unique method to satisfy different sets of axioms among a class of attribution methods. △ Less

Submitted 23 June, 2023; originally announced June 2023.

arXiv:2305.03100 [pdf, ps, other]

Distributing Synergy Functions: Unifying Game-Theoretic Interaction Methods for Machine-Learning Explainability

Authors: Daniel Lundstrom, Meisam Razaviyayn

Abstract: Deep learning has revolutionized many areas of machine learning, from computer vision to natural language processing, but these high-performance models are generally "black box." Explaining such models would improve transparency and trust in AI-powered decision making and is necessary for understanding other practical needs such as robustness and fairness. A popular means of enhancing model transp… ▽ More Deep learning has revolutionized many areas of machine learning, from computer vision to natural language processing, but these high-performance models are generally "black box." Explaining such models would improve transparency and trust in AI-powered decision making and is necessary for understanding other practical needs such as robustness and fairness. A popular means of enhancing model transparency is to quantify how individual inputs contribute to model outputs (called attributions) and the magnitude of interactions between groups of inputs. A growing number of these methods import concepts and results from game theory to produce attributions and interactions. This work presents a unifying framework for game-theory-inspired attribution and $k^\text{th}$-order interaction methods. We show that, given modest assumptions, a unique full account of interactions between features, called synergies, is possible in the continuous input setting. We identify how various methods are characterized by their policy of distributing synergies. We also demonstrate that gradient-based methods are characterized by their actions on monomials, a type of synergy function, and introduce unique gradient-based methods. We show that the combination of various criteria uniquely defines the attribution/interaction methods. Thus, the community needs to identify goals and contexts when developing and employing attribution and interaction methods. △ Less

Submitted 17 May, 2023; v1 submitted 4 May, 2023; originally announced May 2023.

arXiv:2303.08431 [pdf, other]

Policy Gradient Converges to the Globally Optimal Policy for Nearly Linear-Quadratic Regulators

Authors: Yinbin Han, Meisam Razaviyayn, Renyuan Xu

Abstract: Nonlinear control systems with partial information to the decision maker are prevalent in a variety of applications. As a step toward studying such nonlinear systems, this work explores reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic regulator systems. In particular, we consider a dynamic system that combines linear and nonlinear components, and is gove… ▽ More Nonlinear control systems with partial information to the decision maker are prevalent in a variety of applications. As a step toward studying such nonlinear systems, this work explores reinforcement learning methods for finding the optimal policy in the nearly linear-quadratic regulator systems. In particular, we consider a dynamic system that combines linear and nonlinear components, and is governed by a policy with the same structure. Assuming that the nonlinear component comprises kernels with small Lipschitz coefficients, we characterize the optimization landscape of the cost function. Although the cost function is nonconvex in general, we establish the local strong convexity and smoothness in the vicinity of the global optimizer. Additionally, we propose an initialization mechanism to leverage these properties. Building on the developments, we design a policy gradient algorithm that is guaranteed to converge to the globally optimal policy with a linear rate. △ Less

Submitted 9 April, 2025; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: 34 pages

arXiv:2210.14410 [pdf, other]

Improving Adversarial Robustness via Joint Classification and Multiple Explicit Detection Classes

Authors: Sina Baharlouei, Fatemeh Sheikholeslami, Meisam Razaviyayn, Zico Kolter

Abstract: This work concerns the development of deep networks that are certifiably robust to adversarial attacks. Joint robust classification-detection was recently introduced as a certified defense mechanism, where adversarial examples are either correctly classified or assigned to the "abstain" class. In this work, we show that such a provable framework can benefit by extension to networks with multiple e… ▽ More This work concerns the development of deep networks that are certifiably robust to adversarial attacks. Joint robust classification-detection was recently introduced as a certified defense mechanism, where adversarial examples are either correctly classified or assigned to the "abstain" class. In this work, we show that such a provable framework can benefit by extension to networks with multiple explicit abstain classes, where the adversarial examples are adaptively assigned to those. We show that naively adding multiple abstain classes can lead to "model degeneracy", then we propose a regularization approach and a training method to counter this degeneracy by promoting full use of the multiple abstain classes. Our experiments demonstrate that the proposed approach consistently achieves favorable standard vs. robust verified accuracy tradeoffs, outperforming state-of-the-art algorithms for various choices of number of abstain classes. △ Less

Submitted 10 May, 2023; v1 submitted 25 October, 2022; originally announced October 2022.

Comments: 20 pages, 6 figures

Journal ref: International Conference on Artificial Intelligence and Statistics, PMLR 2023

arXiv:2210.08781 [pdf, other]

Stochastic Differentially Private and Fair Learning

Authors: Andrew Lowy, Devansh Gupta, Meisam Razaviyayn

Abstract: Machine learning models are increasingly used in high-stakes decision-making systems. In such applications, a major concern is that these models sometimes discriminate against certain demographic groups such as individuals with certain race, gender, or age. Another major concern in these applications is the violation of the privacy of users. While fair learning algorithms have been developed to mi… ▽ More Machine learning models are increasingly used in high-stakes decision-making systems. In such applications, a major concern is that these models sometimes discriminate against certain demographic groups such as individuals with certain race, gender, or age. Another major concern in these applications is the violation of the privacy of users. While fair learning algorithms have been developed to mitigate discrimination issues, these algorithms can still leak sensitive information, such as individuals' health or financial records. Utilizing the notion of differential privacy (DP), prior works aimed at developing learning algorithms that are both private and fair. However, existing algorithms for DP fair learning are either not guaranteed to converge or require full batch of data in each iteration of the algorithm to converge. In this paper, we provide the first stochastic differentially private algorithm for fair learning that is guaranteed to converge. Here, the term "stochastic" refers to the fact that our proposed algorithm converges even when minibatches of data are used at each iteration (i.e. stochastic optimization). Our framework is flexible enough to permit different fairness notions, including demographic parity and equalized odds. In addition, our algorithm can be applied to non-binary classification tasks with multiple (non-binary) sensitive attributes. As a byproduct of our convergence analysis, we provide the first utility guarantee for a DP algorithm for solving nonconvex-strongly concave min-max problems. Our numerical experiments show that the proposed algorithm consistently offers significant performance gains over the state-of-the-art baselines, and can be applied to larger scale problems with non-binary target/sensitive attributes. △ Less

Submitted 3 June, 2023; v1 submitted 17 October, 2022; originally announced October 2022.

Comments: ICLR 2023

arXiv:2209.11920 [pdf, other]

Tradeoffs between convergence rate and noise amplification for momentum-based accelerated optimization algorithms

Authors: Hesameddin Mohammadi, Meisam Razaviyayn, Mihailo R. Jovanović

Abstract: We study momentum-based first-order optimization algorithms in which the iterations utilize information from the two previous steps and are subject to an additive white noise. This setup uses noise to account for uncertainty in either gradient evaluation or iteration updates, and it includes Polyak's heavy-ball and Nesterov's accelerated methods as special cases. For strongly convex quadratic prob… ▽ More We study momentum-based first-order optimization algorithms in which the iterations utilize information from the two previous steps and are subject to an additive white noise. This setup uses noise to account for uncertainty in either gradient evaluation or iteration updates, and it includes Polyak's heavy-ball and Nesterov's accelerated methods as special cases. For strongly convex quadratic problems, we use the steady-state variance of the error in the optimization variable to quantify noise amplification and identify fundamental stochastic performance tradeoffs. Our approach utilizes the Jury stability criterion to provide a novel geometric characterization of conditions for linear convergence, and it reveals the relation between the noise amplification and convergence rate as well as their dependence on the condition number and the constant algorithmic parameters. This geometric insight leads to simple alternative proofs of standard convergence results and allows us to establish ``uncertainty principle'' of strongly convex optimization: for the two-step momentum method with linear convergence rate, the lower bound on the product between the settling time and noise amplification scales quadratically with the condition number. Our analysis also identifies a key difference between the gradient and iterate noise models: while the amplification of gradient noise can be made arbitrarily small by sufficiently decelerating the algorithm, the best achievable variance for the iterate noise model increases linearly with the settling time in the decelerating regime. Finally, we introduce two parameterized families of algorithms that strike a balance between noise amplification and settling time while preserving order-wise Pareto optimality for both noise models. △ Less

Submitted 19 June, 2024; v1 submitted 24 September, 2022; originally announced September 2022.

Comments: 23 pages; 7 figures

arXiv:2209.07403 [pdf, other]

Private Stochastic Optimization With Large Worst-Case Lipschitz Parameter

Authors: Andrew Lowy, Meisam Razaviyayn

Abstract: We study differentially private (DP) stochastic optimization (SO) with loss functions whose worst-case Lipschitz parameter over all data may be extremely large or infinite. To date, the vast majority of work on DP SO assumes that the loss is uniformly Lipschitz continuous (i.e. stochastic gradients are uniformly bounded) over data. While this assumption is convenient, it often leads to pessimistic… ▽ More We study differentially private (DP) stochastic optimization (SO) with loss functions whose worst-case Lipschitz parameter over all data may be extremely large or infinite. To date, the vast majority of work on DP SO assumes that the loss is uniformly Lipschitz continuous (i.e. stochastic gradients are uniformly bounded) over data. While this assumption is convenient, it often leads to pessimistic risk bounds. In many practical problems, the worst-case (uniform) Lipschitz parameter of the loss over all data may be huge due to outliers and/or heavy-tailed data. In such cases, the risk bounds for DP SO, which scale with the worst-case Lipschitz parameter, are vacuous. To address these limitations, we provide improved risk bounds that do not depend on the uniform Lipschitz parameter. Following a recent line of work [WXDX20, KLZ22], we assume that stochastic gradients have bounded $k$-th order moments for some $k \geq 2$. Compared with works on uniformly Lipschitz DP SO, our risk bounds scale with the $k$-th moment instead of the uniform Lipschitz parameter of the loss, allowing for significantly faster rates in the presence of outliers and/or heavy-tailed data. For smooth convex loss functions, we provide linear-time algorithms with state-of-the-art excess risk. We complement our excess risk upper bounds with novel lower bounds. In certain parameter regimes, our linear-time excess risk bounds are minimax optimal. Second, we provide the first algorithm to handle non-smooth convex loss functions. To do so, we develop novel algorithmic and stability-based proof techniques, which we believe will be useful for future work in obtaining optimal excess risk. Finally, our work is the first to address non-convex non-uniformly Lipschitz loss functions satisfying the Proximal-PL inequality; this covers some practical machine learning models. Our Proximal-PL algorithm has near-optimal excess risk. △ Less

Submitted 27 September, 2024; v1 submitted 15 September, 2022; originally announced September 2022.

Comments: To appear in Journal of Privacy and Confidentiality. A preliminary version appeared at International Conference on Algorithmic Learning Theory (ALT) 2023

arXiv:2204.07306 [pdf, other]

Congestion Reduction via Personalized Incentives

Authors: Ali Ghafelebashi, Meisam Razaviyayn, Maged Dessouky

Abstract: With rapid population growth and urban development, traffic congestion has become an inescapable issue, especially in large cities. Many congestion reduction strategies have been proposed in the past, ranging from roadway extension to transportation demand management. In particular, congestion pricing schemes have been used as negative reinforcements for traffic control. In this project, we study… ▽ More With rapid population growth and urban development, traffic congestion has become an inescapable issue, especially in large cities. Many congestion reduction strategies have been proposed in the past, ranging from roadway extension to transportation demand management. In particular, congestion pricing schemes have been used as negative reinforcements for traffic control. In this project, we study an alternative approach of offering positive incentives to drivers to take different routes. More specifically, we propose an algorithm to reduce traffic congestion and improve routing efficiency via offering personalized incentives to drivers. We exploit the wide-accessibility of smart devices to communicate with drivers and develop an incentive offering mechanism using individuals' preferences and aggregate traffic information. The incentives are offered after solving a large-scale optimization problem in order to minimize the total travel time (or minimize any cost function of the network such as total Carbon emission). Since this massive size optimization problem needs to be solved continually in the network, we developed a distributed computational approach. The proposed distributed algorithm is guaranteed to converge under a mild set of assumptions that are verified with real data. We evaluated the performance of our algorithm using traffic data from the Los Angeles area. Our experiments show congestion reduction of up to 11% in arterial roads and highways. △ Less

Submitted 14 April, 2022; originally announced April 2022.

Comments: 24 pages, 8 figures

arXiv:2203.06735 [pdf, other]

Private Non-Convex Federated Learning Without a Trusted Server

Authors: Andrew Lowy, Ali Ghafelebashi, Meisam Razaviyayn

Abstract: We study federated learning (FL) -- especially cross-silo FL -- with non-convex loss functions and data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) must protect the privacy of each person's data (e.g. patient's medical record), even if the server or other silos act as adversarial eavesdroppers. To that end, we consider inter-silo record-level… ▽ More We study federated learning (FL) -- especially cross-silo FL -- with non-convex loss functions and data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) must protect the privacy of each person's data (e.g. patient's medical record), even if the server or other silos act as adversarial eavesdroppers. To that end, we consider inter-silo record-level (ISRL) differential privacy (DP), which requires silo~$i$'s communications to satisfy record/item-level DP. We propose novel ISRL-DP algorithms for FL with heterogeneous (non-i.i.d.) silo data and two classes of Lipschitz continuous loss functions: First, we consider losses satisfying the Proximal Polyak-Lojasiewicz (PL) inequality, which is an extension of the classical PL condition to the constrained setting. In contrast to our result, prior works only considered unconstrained private optimization with Lipschitz PL loss, which rules out most interesting PL losses such as strongly convex problems and linear/logistic regression. Our algorithms nearly attain the optimal strongly convex, homogeneous (i.i.d.) rate for ISRL-DP FL without assuming convexity or i.i.d. data. Second, we give the first private algorithms for non-convex non-smooth loss functions. Our utility bounds even improve on the state-of-the-art bounds for smooth losses. We complement our upper bounds with lower bounds. Additionally, we provide shuffle DP (SDP) algorithms that improve over the state-of-the-art central DP algorithms under more practical trust assumptions. Numerical experiments show that our algorithm has better accuracy than baselines for most privacy levels. All the codes are publicly available at: https://github.com/ghafeleb/Private-NonConvex-Federated-Learning-Without-a-Trusted-Server. △ Less

Submitted 25 June, 2023; v1 submitted 13 March, 2022; originally announced March 2022.

Comments: AISTATS 2023

arXiv:2202.11912 [pdf, other]

A Rigorous Study of Integrated Gradients Method and Extensions to Internal Neuron Attributions

Authors: Daniel Lundstrom, Tianjian Huang, Meisam Razaviyayn

Abstract: As deep learning (DL) efficacy grows, concerns for poor model explainability grow also. Attribution methods address the issue of explainability by quantifying the importance of an input feature for a model prediction. Among various methods, Integrated Gradients (IG) sets itself apart by claiming other methods failed to satisfy desirable axioms, while IG and methods like it uniquely satisfy said ax… ▽ More As deep learning (DL) efficacy grows, concerns for poor model explainability grow also. Attribution methods address the issue of explainability by quantifying the importance of an input feature for a model prediction. Among various methods, Integrated Gradients (IG) sets itself apart by claiming other methods failed to satisfy desirable axioms, while IG and methods like it uniquely satisfy said axioms. This paper comments on fundamental aspects of IG and its applications/extensions: 1) We identify key differences between IG function spaces and the supporting literature's function spaces which problematize previous claims of IG uniqueness. We show that with the introduction of an additional axiom, \textit{non-decreasing positivity}, the uniqueness claims can be established. 2) We address the question of input sensitivity by identifying function classes where IG is/is not Lipschitz in the attributed input. 3) We show that axioms for single-baseline methods have analogous properties for methods with probability distribution baselines. 4) We introduce a computationally efficient method of identifying internal neurons that contribute to specified regions of an IG attribution map. Finally, we present experimental results validating this method. △ Less

Submitted 29 June, 2022; v1 submitted 24 February, 2022; originally announced February 2022.

Comments: arXiv admin note: text overlap with arXiv:1703.01365 by other authors

arXiv:2110.11205 [pdf, other]

Robustness through Data Augmentation Loss Consistency

Authors: Tianjian Huang, Shaunak Halbe, Chinnadhurai Sankar, Pooyan Amini, Satwik Kottur, Alborz Geramifard, Meisam Razaviyayn, Ahmad Beirami

Abstract: While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM is not robust to distribution shifts or adversarial attacks. Synthetic data augmentation followed by empirical risk minimization (DA-ERM) is a simple and widely used solution to improve robustness in ERM. In addition, consistency regularization can be… ▽ More While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM is not robust to distribution shifts or adversarial attacks. Synthetic data augmentation followed by empirical risk minimization (DA-ERM) is a simple and widely used solution to improve robustness in ERM. In addition, consistency regularization can be applied to further improve the robustness of the model by forcing the representation of the original sample and the augmented one to be similar. However, existing consistency regularization methods are not applicable to covariant data augmentation, where the label in the augmented sample is dependent on the augmentation function. For example, dialog state covaries with named entity when we augment data with a new named entity. In this paper, we propose data augmented loss invariant regularization (DAIR), a simple form of consistency regularization that is applied directly at the loss level rather than intermediate features, making it widely applicable to both invariant and covariant data augmentation regardless of network architecture, problem setup, and task. We apply DAIR to real-world learning problems involving covariant data augmentation: robust neural task-oriented dialog state tracking and robust visual question answering. We also apply DAIR to tasks involving invariant data augmentation: robust regression, robust classification against adversarial attacks, and robust ImageNet classification under distribution shift. Our experiments show that DAIR consistently outperforms ERM and DA-ERM with little marginal computational cost and sets new state-of-the-art results in several benchmarks involving covariant data augmentation. Our code of all experiments is available at: https://github.com/optimization-for-data-driven-science/DAIR.git △ Less

Submitted 24 January, 2023; v1 submitted 21 October, 2021; originally announced October 2021.

Comments: 40 pages

arXiv:2110.03950 [pdf, ps, other]

Nonconvex-Nonconcave Min-Max Optimization with a Small Maximization Domain

Authors: Dmitrii M. Ostrovskii, Babak Barazandeh, Meisam Razaviyayn

Abstract: We study the problem of finding approximate first-order stationary points in optimization problems of the form $\min_{x \in X} \max_{y \in Y} f(x,y)$, where the sets $X,Y$ are convex and $Y$ is compact. The objective function $f$ is smooth, but assumed neither convex in $x$ nor concave in $y$. Our approach relies upon replacing the function $f(x,\cdot)$ with its $k$th order Taylor approximation (i… ▽ More We study the problem of finding approximate first-order stationary points in optimization problems of the form $\min_{x \in X} \max_{y \in Y} f(x,y)$, where the sets $X,Y$ are convex and $Y$ is compact. The objective function $f$ is smooth, but assumed neither convex in $x$ nor concave in $y$. Our approach relies upon replacing the function $f(x,\cdot)$ with its $k$th order Taylor approximation (in $y$) and finding a near-stationary point in the resulting surrogate problem. To guarantee its success, we establish the following result: let the Euclidean diameter of $Y$ be small in terms of the target accuracy $\varepsilon$, namely $O(\varepsilon^{\frac{2}{k+1}})$ for $k \in \mathbb{N}$ and $O(\varepsilon)$ for $k = 0$, with the constant factors controlled by certain regularity parameters of $f$; then any $\varepsilon$-stationary point in the surrogate problem remains $O(\varepsilon)$-stationary for the initial problem. Moreover, we show that these upper bounds are nearly optimal: the aforementioned reduction provably fails when the diameter of $Y$ is larger. For $0 \le k \le 2$ the surrogate function can be efficiently maximized in $y$; our general approximation result then leads to efficient algorithms for finding a near-stationary point in nonconvex-nonconcave min-max problems, for which we also provide convergence guarantees. △ Less

Submitted 8 October, 2021; originally announced October 2021.

Comments: 50 pages

arXiv:2109.00644 [pdf, other]

RIFLE: Imputation and Robust Inference from Low Order Marginals

Authors: Sina Baharlouei, Kelechi Ogudu, Sze-chuan Suen, Meisam Razaviyayn

Abstract: The ubiquity of missing values in real-world datasets poses a challenge for statistical inference and can prevent similar datasets from being analyzed in the same study, precluding many existing datasets from being used for new analyses. While an extensive collection of packages and algorithms have been developed for data imputation, the overwhelming majority perform poorly if there are many missi… ▽ More The ubiquity of missing values in real-world datasets poses a challenge for statistical inference and can prevent similar datasets from being analyzed in the same study, precluding many existing datasets from being used for new analyses. While an extensive collection of packages and algorithms have been developed for data imputation, the overwhelming majority perform poorly if there are many missing values and low sample sizes, which are unfortunately common characteristics in empirical data. Such low-accuracy estimations adversely affect the performance of downstream statistical models. We develop a statistical inference framework for regression and classification in the presence of missing data without imputation. Our framework, RIFLE (Robust InFerence via Low-order moment Estimations), estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model. We specialize our framework to linear regression and normal discriminant analysis, and we provide convergence and performance guarantees. This framework can also be adapted to impute missing data. In numerical experiments, we compare RIFLE to several state-of-the-art approaches (including MICE, Amelia, MissForest, KNN-imputer, MIDA, and Mean Imputer) for imputation and inference in the presence of missing values. Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small. RIFLE is publicly available at https://github.com/optimization-for-data-driven-science/RIFLE. △ Less

Submitted 12 September, 2023; v1 submitted 1 September, 2021; originally announced September 2021.

Comments: 36 pages, 11 figures

Journal ref: Transaction on Machine Learning Research (TMLR), 09/2023

arXiv:2106.09779 [pdf, other]

Private Federated Learning Without a Trusted Server: Optimal Algorithms for Convex Losses

Authors: Andrew Lowy, Meisam Razaviyayn

Abstract: This paper studies federated learning (FL)--especially cross-silo FL--with data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) has data from different people (e.g. patients) and must maintain the privacy of each person's data (e.g. medical record), even if the server or other silos act as adversarial eavesdroppers. This requirement motivates the… ▽ More This paper studies federated learning (FL)--especially cross-silo FL--with data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) has data from different people (e.g. patients) and must maintain the privacy of each person's data (e.g. medical record), even if the server or other silos act as adversarial eavesdroppers. This requirement motivates the study of Inter-Silo Record-Level Differential Privacy (ISRL-DP), which requires silos' communications to satisfy record/item-level differential privacy (DP). ISRL-DP ensures that the data of each person (e.g. patient) in silo i (e.g. hospital i) cannot be leaked. ISRL-DP is different from well-studied privacy notions. Central and user-level DP assume that people trust the server/other silos. On the other end of the spectrum, local DP assumes that people do not trust anyone at all (even their own silo). Sitting between central and local DP, ISRL-DP makes the realistic assumption (in cross-silo FL) that people trust their own silo, but not the server or other silos. In this work, we provide tight (up to logarithms) upper and lower bounds for ISRL-DP FL with convex/strongly convex loss functions and homogeneous (i.i.d.) silo data. Remarkably, we show that similar bounds are attainable for smooth losses with arbitrary heterogeneous silo data distributions, via an accelerated ISRL-DP algorithm. We also provide tight upper and lower bounds for ISRL-DP federated empirical risk minimization, and use acceleration to attain the optimal bounds in fewer rounds of communication than the state-of-the-art. Finally, with a secure "shuffler" to anonymize silo messages (but without a trusted server), our algorithm attains the optimal central DP rates under more practical trust assumptions. Numerical experiments show favorable privacy-accuracy tradeoffs for our algorithm in classification and regression tasks. △ Less

Submitted 24 November, 2024; v1 submitted 17 June, 2021; originally announced June 2021.

Comments: ICLR 2023

arXiv:2105.05953 [pdf, other]

Efficient Algorithms for Estimating the Parameters of Mixed Linear Regression Models

Authors: Babak Barazandeh, Ali Ghafelebashi, Meisam Razaviyayn, Ram Sriharsha

Abstract: Mixed linear regression (MLR) model is among the most exemplary statistical tools for modeling non-linear distributions using a mixture of linear models. When the additive noise in MLR model is Gaussian, Expectation-Maximization (EM) algorithm is a widely-used algorithm for maximum likelihood estimation of MLR parameters. However, when noise is non-Gaussian, the steps of EM algorithm may not have… ▽ More Mixed linear regression (MLR) model is among the most exemplary statistical tools for modeling non-linear distributions using a mixture of linear models. When the additive noise in MLR model is Gaussian, Expectation-Maximization (EM) algorithm is a widely-used algorithm for maximum likelihood estimation of MLR parameters. However, when noise is non-Gaussian, the steps of EM algorithm may not have closed-form update rules, which makes EM algorithm impractical. In this work, we study the maximum likelihood estimation of the parameters of MLR model when the additive noise has non-Gaussian distribution. In particular, we consider the case that noise has Laplacian distribution and we first show that unlike the the Gaussian case, the resulting sub-problems of EM algorithm in this case does not have closed-form update rule, thus preventing us from using EM in this case. To overcome this issue, we propose a new algorithm based on combining the alternating direction method of multipliers (ADMM) with EM algorithm idea. Our numerical experiments show that our method outperforms the EM algorithm in statistical accuracy and computational time in non-Gaussian noise case. △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2102.12586 [pdf, other]

A Stochastic Optimization Framework for Fair Risk Minimization

Authors: Andrew Lowy, Sina Baharlouei, Rakesh Pavan, Meisam Razaviyayn, Ahmad Beirami

Abstract: Despite the success of large-scale empirical risk minimization (ERM) at achieving high accuracy across a variety of machine learning tasks, fair ERM is hindered by the incompatibility of fairness constraints with stochastic optimization. We consider the problem of fair classification with discrete sensitive attributes and potentially large models and data sets, requiring stochastic solvers. Existi… ▽ More Despite the success of large-scale empirical risk minimization (ERM) at achieving high accuracy across a variety of machine learning tasks, fair ERM is hindered by the incompatibility of fairness constraints with stochastic optimization. We consider the problem of fair classification with discrete sensitive attributes and potentially large models and data sets, requiring stochastic solvers. Existing in-processing fairness algorithms are either impractical in the large-scale setting because they require large batches of data at each iteration or they are not guaranteed to converge. In this paper, we develop the first stochastic in-processing fairness algorithm with guaranteed convergence. For demographic parity, equalized odds, and equal opportunity notions of fairness, we provide slight variations of our algorithm--called FERMI--and prove that each of these variations converges in stochastic optimization with any batch size. Empirically, we show that FERMI is amenable to stochastic solvers with multiple (non-binary) sensitive attributes and non-binary targets, performing well even with minibatch size as small as one. Extensive experiments show that FERMI achieves the most favorable tradeoffs between fairness violation and test accuracy across all tested setups compared with state-of-the-art baselines for demographic parity, equalized odds, equal opportunity. These benefits are especially significant with small batch sizes and for non-binary classification with large number of sensitive attributes, making FERMI a practical, scalable fairness algorithm. The code for all of the experiments in this paper is available at: https://github.com/optimization-for-data-driven-science/FERMI. △ Less

Submitted 11 January, 2023; v1 submitted 24 February, 2021; originally announced February 2021.

Comments: 44 pages

Journal ref: Transactions on Machine Learning Research, 2022

arXiv:2102.04704 [pdf, ps, other]

Output Perturbation for Differentially Private Convex Optimization: Faster and More General

Authors: Andrew Lowy, Meisam Razaviyayn

Abstract: Finding efficient, easily implementable differentially private (DP) algorithms that offer strong excess risk bounds is an important problem in modern machine learning. To date, most work has focused on private empirical risk minimization (ERM) or private stochastic convex optimization (SCO), which corresponds to population loss minimization. However, there are often other objectives-such as fairne… ▽ More Finding efficient, easily implementable differentially private (DP) algorithms that offer strong excess risk bounds is an important problem in modern machine learning. To date, most work has focused on private empirical risk minimization (ERM) or private stochastic convex optimization (SCO), which corresponds to population loss minimization. However, there are often other objectives-such as fairness, adversarial robustness, or sensitivity to outliers-besides average performance that are not captured in the classical ERM/SCO setups. Further, most recent work in private SCO has focused on $(\varepsilon, δ)$-DP ($δ> 0$), whereas proving tight excess risk and runtime bounds for $(\varepsilon, 0)$-differential privacy remains a challenging open problem. Our first contribution is to provide the tightest known $(\varepsilon, 0)$-differentially private expected population loss bounds and fastest runtimes for smooth and strongly convex loss functions. In particular, for SCO with well-conditioned smooth and strongly convex loss functions, we provide a linear-time algorithm with optimal excess risk. For our second contribution, we study DP optimization for a broad class of tilted loss functions-which can be used to promote fairness or robustness, and are not necessarily of ERM form. We establish the first known DP excess risk and runtime bounds for optimizing this class; under smoothness and strong convexity assumptions, our bounds are near optimal. For our third contribution, we specialize our theory to DP adversarial training. Our results are achieved using perhaps the simplest yet practical differentially private algorithm: output perturbation. Although this method is not novel conceptually, our novel implementation scheme and analysis show that the power of this method to achieve strong privacy, utility, and runtime guarantees has not been fully appreciated in prior works. △ Less

Submitted 19 September, 2024; v1 submitted 9 February, 2021; originally announced February 2021.

arXiv:2012.02901 [pdf, other]

Near-Optimal Procedures for Model Discrimination with Non-Disclosure Properties

Authors: Dmitrii M. Ostrovskii, Mohamed Ndaoud, Adel Javanmard, Meisam Razaviyayn

Abstract: Let $θ_0,θ_1 \in \mathbb{R}^d$ be the population risk minimizers associated to some loss $\ell:\mathbb{R}^d\times \mathcal{Z}\to\mathbb{R}$ and two distributions $\mathbb{P}_0,\mathbb{P}_1$ on $\mathcal{Z}$. The models $θ_0,θ_1$ are unknown, and $\mathbb{P}_0,\mathbb{P}_1$ can be accessed by drawing i.i.d samples from them. Our work is motivated by the following model discrimination question: "Wha… ▽ More Let $θ_0,θ_1 \in \mathbb{R}^d$ be the population risk minimizers associated to some loss $\ell:\mathbb{R}^d\times \mathcal{Z}\to\mathbb{R}$ and two distributions $\mathbb{P}_0,\mathbb{P}_1$ on $\mathcal{Z}$. The models $θ_0,θ_1$ are unknown, and $\mathbb{P}_0,\mathbb{P}_1$ can be accessed by drawing i.i.d samples from them. Our work is motivated by the following model discrimination question: "What sizes of the samples from $\mathbb{P}_0$ and $\mathbb{P}_1$ allow to distinguish between the two hypotheses $θ^*=θ_0$ and $θ^*=θ_1$ for given $θ^*\in\{θ_0,θ_1\}$?" Making the first steps towards answering it in full generality, we first consider the case of a well-specified linear model with squared loss. Here we provide matching upper and lower bounds on the sample complexity as given by $\min\{1/Δ^2,\sqrt{r}/Δ\}$ up to a constant factor; here $Δ$ is a measure of separation between $\mathbb{P}_0$ and $\mathbb{P}_1$ and $r$ is the rank of the design covariance matrix. We then extend this result in two directions: (i) for general parametric models in asymptotic regime; (ii) for generalized linear models in small samples ($n\le r$) under weak moment assumptions. In both cases we derive sample complexity bounds of a similar form while allowing for model misspecification. In fact, our testing procedures only access $θ^*$ via a certain functional of empirical risk. In addition, the number of observations that allows us to reach statistical confidence does not allow to "resolve" the two models $-$ that is, recover $θ_0,θ_1$ up to $O(Δ)$ prediction accuracy. These two properties allow to use our framework in applied tasks where one would like to $\textit{identify}$ a prediction model, which can be proprietary, while guaranteeing that the model cannot be actually $\textit{inferred}$ by the identifying agent. △ Less

Submitted 10 July, 2021; v1 submitted 4 December, 2020; originally announced December 2020.

Comments: 52 pages, 2 figures; corrected the proof of the lower bound; added new applications and the Fisher information-based argument in Appendix F

arXiv:2009.03482 [pdf, ps, other]

Alternating Direction Method of Multipliers for Quantization

Authors: Tianjian Huang, Prajwal Singhania, Maziar Sanjabi, Pabitra Mitra, Meisam Razaviyayn

Abstract: Quantization of the parameters of machine learning models, such as deep neural networks, requires solving constrained optimization problems, where the constraint set is formed by the Cartesian product of many simple discrete sets. For such optimization problems, we study the performance of the Alternating Direction Method of Multipliers for Quantization ($\texttt{ADMM-Q}$) algorithm, which is a va… ▽ More Quantization of the parameters of machine learning models, such as deep neural networks, requires solving constrained optimization problems, where the constraint set is formed by the Cartesian product of many simple discrete sets. For such optimization problems, we study the performance of the Alternating Direction Method of Multipliers for Quantization ($\texttt{ADMM-Q}$) algorithm, which is a variant of the widely-used ADMM method applied to our discrete optimization problem. We establish the convergence of the iterates of $\texttt{ADMM-Q}$ to certain $\textit{stationary points}$. To the best of our knowledge, this is the first analysis of an ADMM-type method for problems with discrete variables/constraints. Based on our theoretical insights, we develop a few variants of $\texttt{ADMM-Q}$ that can handle inexact update rules, and have improved performance via the use of "soft projection" and "injecting randomness to the algorithm". We empirically evaluate the efficacy of our proposed approaches. △ Less

Submitted 1 March, 2021; v1 submitted 7 September, 2020; originally announced September 2020.

arXiv:2006.08141 [pdf, other]

doi 10.1109/MSP.2020.3003851

Non-convex Min-Max Optimization: Applications, Challenges, and Recent Theoretical Advances

Authors: Meisam Razaviyayn, Tianjian Huang, Songtao Lu, Maher Nouiehed, Maziar Sanjabi, Mingyi Hong

Abstract: The min-max optimization problem, also known as the saddle point problem, is a classical optimization problem which is also studied in the context of zero-sum games. Given a class of objective functions, the goal is to find a value for the argument which leads to a small objective value even for the worst case function in the given class. Min-max optimization problems have recently become very pop… ▽ More The min-max optimization problem, also known as the saddle point problem, is a classical optimization problem which is also studied in the context of zero-sum games. Given a class of objective functions, the goal is to find a value for the argument which leads to a small objective value even for the worst case function in the given class. Min-max optimization problems have recently become very popular in a wide range of signal and data processing applications such as fair beamforming, training generative adversarial networks (GANs), and robust machine learning, to just name a few. The overarching goal of this article is to provide a survey of recent advances for an important subclass of min-max problem, where the minimization and maximization problems can be non-convex and/or non-concave. In particular, we will first present a number of applications to showcase the importance of such min-max problems; then we discuss key theoretical challenges, and provide a selective review of some exciting recent theoretical and algorithmic advances in tackling non-convex min-max problems. Finally, we will point out open questions and future research directions. △ Less

Submitted 18 August, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

Journal ref: IEEE Signal Processing Magazine (Volume: 37, Issue: 5, Sept. 2020)

arXiv:2003.08093 [pdf, other]

Solving Non-Convex Non-Differentiable Min-Max Games using Proximal Gradient Method

Authors: Babak Barazandeh, Meisam Razaviyayn

Abstract: Min-max saddle point games appear in a wide range of applications in machine leaning and signal processing. Despite their wide applicability, theoretical studies are mostly limited to the special convex-concave structure. While some recent works generalized these results to special smooth non-convex cases, our understanding of non-smooth scenarios is still limited. In this work, we study special f… ▽ More Min-max saddle point games appear in a wide range of applications in machine leaning and signal processing. Despite their wide applicability, theoretical studies are mostly limited to the special convex-concave structure. While some recent works generalized these results to special smooth non-convex cases, our understanding of non-smooth scenarios is still limited. In this work, we study special form of non-smooth min-max games when the objective function is (strongly) convex with respect to one of the player's decision variable. We show that a simple multi-step proximal gradient descent-ascent algorithm converges to $ε$-first-order Nash equilibrium of the min-max game with the number of gradient evaluations being polynomial in $1/ε$. We will also show that our notion of stationarity is stronger than existing ones in the literature. Finally, we evaluate the performance of the proposed algorithm through adversarial attack on a LASSO estimator. △ Less

Submitted 18 March, 2020; originally announced March 2020.

arXiv:2002.07919 [pdf, other]

Efficient Search of First-Order Nash Equilibria in Nonconvex-Concave Smooth Min-Max Problems

Authors: Dmitrii M. Ostrovskii, Andrew Lowy, Meisam Razaviyayn

Abstract: We propose an efficient algorithm for finding first-order Nash equilibria in min-max problems of the form $\min_{x \in X}\max_{y\in Y} F(x,y)$, where the objective function is smooth in both variables and concave with respect to $y$; the sets $X$ and $Y$ are convex and "projection-friendly," and $Y$ is compact. Our goal is to find an $(\varepsilon_x,\varepsilon_y)$-first-order Nash equilibrium wit… ▽ More We propose an efficient algorithm for finding first-order Nash equilibria in min-max problems of the form $\min_{x \in X}\max_{y\in Y} F(x,y)$, where the objective function is smooth in both variables and concave with respect to $y$; the sets $X$ and $Y$ are convex and "projection-friendly," and $Y$ is compact. Our goal is to find an $(\varepsilon_x,\varepsilon_y)$-first-order Nash equilibrium with respect to a stationarity criterion that is stronger than the commonly used proximal gradient norm. The proposed approach is fairly simple: we perform approximate proximal-point iterations on the primal function, with inexact oracle provided by Nesterov's algorithm run on the regularized function $F(x_t,\cdot)$, $x_t$ being the current primal iterate. The resulting iteration complexity is $O(\varepsilon_x{}^{-2} \varepsilon_y{}^{-1/2})$ up to a logarithmic factor. As a byproduct, the choice $\varepsilon_y = O(\varepsilon_x{}^2)$ allows for the $O(\varepsilon_x{}^{-3})$ complexity of finding an $\varepsilon_x$-stationary point for the standard Moreau envelope of the primal function. Moreover, when the objective is strongly concave with respect to $y$, the complexity estimate for our algorithm improves to $O(\varepsilon_x{}^{-2}{κ_y}^{1/2})$ up to a logarithmic factor, where $κ_y$ is the condition number appropriately adjusted for coupling. In both scenarios, the complexity estimates are the best known so far, and are only known for the (weaker) proximal gradient norm criterion. Meanwhile, our approach is "user-friendly:" (i) the algorithm is built upon running a variant of Nesterov's accelerated algorithm as subroutine and avoids extragradient steps; (ii) the convergence analysis recycles the well-known results on accelerated methods with inexact oracle. Finally, we extend the approach to non-Euclidean proximal geometries. △ Less

Submitted 2 May, 2021; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: 29 pages; accepted to SIAM Journal on Optimization (as of May 2021)

MSC Class: 90C06; 90C25; 90C26; 91A99

arXiv:2001.07819 [pdf, other]

Zeroth-Order Algorithms for Nonconvex Minimax Problems with Improved Complexities

Authors: Zhongruo Wang, Krishnakumar Balasubramanian, Shiqian Ma, Meisam Razaviyayn

Abstract: In this paper, we study zeroth-order algorithms for minimax optimization problems that are nonconvex in one variable and strongly-concave in the other variable. Such minimax optimization problems have attracted significant attention lately due to their applications in modern machine learning tasks. We first consider a deterministic version of the problem. We design and analyze the Zeroth-Order Gra… ▽ More In this paper, we study zeroth-order algorithms for minimax optimization problems that are nonconvex in one variable and strongly-concave in the other variable. Such minimax optimization problems have attracted significant attention lately due to their applications in modern machine learning tasks. We first consider a deterministic version of the problem. We design and analyze the Zeroth-Order Gradient Descent Ascent (\texttt{ZO-GDA}) algorithm, and provide improved results compared to existing works, in terms of oracle complexity. We also propose the Zeroth-Order Gradient Descent Multi-Step Ascent (\texttt{ZO-GDMSA}) algorithm that significantly improves the oracle complexity of \texttt{ZO-GDA}. We then consider stochastic versions of \texttt{ZO-GDA} and \texttt{ZO-GDMSA}, to handle stochastic nonconvex minimax problems. For this case, we provide oracle complexity results under two assumptions on the stochastic gradient: (i) the uniformly bounded variance assumption, which is common in traditional stochastic optimization, and (ii) the Strong Growth Condition (SGC), which has been known to be satisfied by modern over-parametrized machine learning models. We establish that under the SGC assumption, the complexities of the stochastic algorithms match that of deterministic algorithms. Numerical experiments are presented to support our theoretical results. △ Less

Submitted 4 April, 2022; v1 submitted 21 January, 2020; originally announced January 2020.

Comments: To appear in the Journal of Global Optimization

arXiv:1911.09815 [pdf, ps, other]

When Does Non-Orthogonal Tensor Decomposition Have No Spurious Local Minima?

Authors: Maziar Sanjabi, Sina Baharlouei, Meisam Razaviyayn, Jason D. Lee

Abstract: We study the optimization problem for decomposing $d$ dimensional fourth-order Tensors with $k$ non-orthogonal components. We derive \textit{deterministic} conditions under which such a problem does not have spurious local minima. In particular, we show that if $κ= \frac{λ_{max}}{λ_{min}} < \frac{5}{4}$, and incoherence coefficient is of the order $O(\frac{1}{\sqrt{d}})$, then all the local minima… ▽ More We study the optimization problem for decomposing $d$ dimensional fourth-order Tensors with $k$ non-orthogonal components. We derive \textit{deterministic} conditions under which such a problem does not have spurious local minima. In particular, we show that if $κ= \frac{λ_{max}}{λ_{min}} < \frac{5}{4}$, and incoherence coefficient is of the order $O(\frac{1}{\sqrt{d}})$, then all the local minima are globally optimal. Using standard techniques, these conditions could be easily transformed into conditions that would hold with high probability in high dimensions when the components are generated randomly. Finally, we prove that the tensor power method with deflation and restarts could efficiently extract all the components within a tolerance level $O(κ\sqrt{kτ^3})$ that seems to be the noise floor of non-orthogonal tensor decomposition. △ Less

Submitted 21 November, 2019; originally announced November 2019.

arXiv:1907.04450 [pdf, ps, other]

SNAP: Finding Approximate Second-Order Stationary Solutions Efficiently for Non-convex Linearly Constrained Problems

Authors: Songtao Lu, Meisam Razaviyayn, Bo Yang, Kejun Huang, Mingyi Hong

Abstract: This paper proposes low-complexity algorithms for finding approximate second-order stationary points (SOSPs) of problems with smooth non-convex objective and linear constraints. While finding (approximate) SOSPs is computationally intractable, we first show that generic instances of the problem can be solved efficiently. More specifically, for a generic problem instance, certain strict complementa… ▽ More This paper proposes low-complexity algorithms for finding approximate second-order stationary points (SOSPs) of problems with smooth non-convex objective and linear constraints. While finding (approximate) SOSPs is computationally intractable, we first show that generic instances of the problem can be solved efficiently. More specifically, for a generic problem instance, certain strict complementarity (SC) condition holds for all Karush-Kuhn-Tucker (KKT) solutions (with probability one). The SC condition is then used to establish an equivalence relationship between two different notions of SOSPs, one of which is computationally easy to verify. Based on this particular notion of SOSP, we design an algorithm named the Successive Negative-curvature grAdient Projection (SNAP), which successively performs either conventional gradient projection or some negative curvature based projection steps to find SOSPs. SNAP and its first-order extension SNAP$^+$, require $\mathcal{O}(1/ε^{2.5})$ iterations to compute an $(ε, \sqrtε)$-SOSP, and their per-iteration computational complexities are polynomial in the number of constraints and problem dimension. To our knowledge, this is the first time that first-order algorithms with polynomial per-iteration complexity and global sublinear rate have been designed to find SOSPs of the important class of non-convex problems with linear constraints. △ Less

Submitted 9 July, 2019; originally announced July 2019.

arXiv:1906.12005 [pdf, other]

Rényi Fair Inference

Authors: Sina Baharlouei, Maher Nouiehed, Ahmad Beirami, Meisam Razaviyayn

Abstract: Machine learning algorithms have been increasingly deployed in critical automated decision-making systems that directly affect human lives. When these algorithms are only trained to minimize the training/test error, they could suffer from systematic discrimination against individuals based on their sensitive attributes such as gender or race. Recently, there has been a surge in machine learning so… ▽ More Machine learning algorithms have been increasingly deployed in critical automated decision-making systems that directly affect human lives. When these algorithms are only trained to minimize the training/test error, they could suffer from systematic discrimination against individuals based on their sensitive attributes such as gender or race. Recently, there has been a surge in machine learning society to develop algorithms for fair machine learning. In particular, many adversarial learning procedures have been proposed to impose fairness. Unfortunately, these algorithms either can only impose fairness up to first-order dependence between the variables, or they lack computational convergence guarantees. In this paper, we use Rényi correlation as a measure of fairness of machine learning models and develop a general training framework to impose fairness. In particular, we propose a min-max formulation which balances the accuracy and fairness when solved to optimality. For the case of discrete sensitive attributes, we suggest an iterative algorithm with theoretical convergence guarantee for solving the proposed min-max problem. Our algorithm and analysis are then specialized to fair classification and the fair clustering problem under disparate impact doctrine. Finally, the performance of the proposed Rényi fair inference framework is evaluated on Adult and Bank datasets. △ Less

Submitted 13 January, 2020; v1 submitted 27 June, 2019; originally announced June 2019.

Comments: 11 pages, 1 figure

Journal ref: International Conference on Learning Representation, 2020

arXiv:1905.11011 [pdf, other]

Robustness of accelerated first-order algorithms for strongly convex optimization problems

Authors: Hesameddin Mohammadi, Meisam Razaviyayn, Mihailo R. Jovanović

Abstract: We study the robustness of accelerated first-order algorithms to stochastic uncertainties in gradient evaluation. Specifically, for unconstrained, smooth, strongly convex optimization problems, we examine the mean-squared error in the optimization variable when the iterates are perturbed by additive white noise. This type of uncertainty may arise in situations where an approximation of the gradien… ▽ More We study the robustness of accelerated first-order algorithms to stochastic uncertainties in gradient evaluation. Specifically, for unconstrained, smooth, strongly convex optimization problems, we examine the mean-squared error in the optimization variable when the iterates are perturbed by additive white noise. This type of uncertainty may arise in situations where an approximation of the gradient is sought through measurements of a real system or in a distributed computation over a network. Even though the underlying dynamics of first-order algorithms for this class of problems are nonlinear, we establish upper bounds on the mean-squared deviation from the optimal solution that are tight up to constant factors. Our analysis quantifies fundamental trade-offs between noise amplification and convergence rates obtained via any acceleration scheme similar to Nesterov's or heavy-ball methods. To gain additional analytical insight, for strongly convex quadratic problems, we explicitly evaluate the steady-state variance of the optimization variable in terms of the eigenvalues of the Hessian of the objective function. We demonstrate that the entire spectrum of the Hessian, rather than just the extreme eigenvalues, influence robustness of noisy algorithms. We specialize this result to the problem of distributed averaging over undirected networks and examine the role of network size and topology on the robustness of noisy accelerated algorithms. △ Less

Submitted 20 February, 2020; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: 45 pages, 6 figures

arXiv:1904.09775 [pdf, other]

Training generative networks using random discriminators

Authors: Babak Barazandeh, Meisam Razaviyayn, Maziar Sanjabi

Abstract: In recent years, Generative Adversarial Networks (GANs) have drawn a lot of attentions for learning the underlying distribution of data in various applications. Despite their wide applicability, training GANs is notoriously difficult. This difficulty is due to the min-max nature of the resulting optimization problem and the lack of proper tools of solving general (non-convex, non-concave) min-max… ▽ More In recent years, Generative Adversarial Networks (GANs) have drawn a lot of attentions for learning the underlying distribution of data in various applications. Despite their wide applicability, training GANs is notoriously difficult. This difficulty is due to the min-max nature of the resulting optimization problem and the lack of proper tools of solving general (non-convex, non-concave) min-max optimization problems. In this paper, we try to alleviate this problem by proposing a new generative network that relies on the use of random discriminators instead of adversarial design. This design helps us to avoid the min-max formulation and leads to an optimization problem that is stable and could be solved efficiently. The performance of the proposed method is evaluated using handwritten digits (MNIST) and Fashion products (Fashion-MNIST) data sets. While the resulting images are not as sharp as adversarial training, the use of random discriminator leads to a much faster algorithm as compared to the adversarial counterpart. This observation, at the minimum, illustrates the potential of the random discriminator approach for warm-start in training GANs. △ Less

Submitted 22 April, 2019; originally announced April 2019.

arXiv:1904.06784 [pdf, ps, other]

A Trust Region Method for Finding Second-Order Stationarity in Linearly Constrained Non-Convex Optimization

Authors: Maher Nouiehed, Meisam Razaviyayn

Abstract: Motivated by TRACE algorithm [Curtis et al. 2017], we propose a trust region algorithm for finding second order stationary points of a linearly constrained non-convex optimization problem. We show the convergence of the proposed algorithm to (ε_g, ε_H)-second order stationary points in \widetilde{\mathcal{O}}(\max{ε_g^{-3/2}, ε_H^{-3}}) iterations. This iteration complexity is achieved for general… ▽ More Motivated by TRACE algorithm [Curtis et al. 2017], we propose a trust region algorithm for finding second order stationary points of a linearly constrained non-convex optimization problem. We show the convergence of the proposed algorithm to (ε_g, ε_H)-second order stationary points in \widetilde{\mathcal{O}}(\max{ε_g^{-3/2}, ε_H^{-3}}) iterations. This iteration complexity is achieved for general linearly constrained optimization without cubic regularization of the objective function. △ Less

Submitted 14 April, 2019; originally announced April 2019.

arXiv:1902.08297 [pdf, other]

Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods

Authors: Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D. Lee, Meisam Razaviyayn

Abstract: Recent applications that arise in machine learning have surged significant interest in solving min-max saddle point games. This problem has been extensively studied in the convex-concave regime for which a global equilibrium solution can be computed efficiently. In this paper, we study the problem in the non-convex regime and show that an \varepsilon--first order stationary point of the game can b… ▽ More Recent applications that arise in machine learning have surged significant interest in solving min-max saddle point games. This problem has been extensively studied in the convex-concave regime for which a global equilibrium solution can be computed efficiently. In this paper, we study the problem in the non-convex regime and show that an \varepsilon--first order stationary point of the game can be computed when one of the player's objective can be optimized to global optimality efficiently. In particular, we first consider the case where the objective of one of the players satisfies the Polyak-Łojasiewicz (PL) condition. For such a game, we show that a simple multi-step gradient descent-ascent algorithm finds an \varepsilon--first order stationary point of the problem in \widetilde{\mathcal{O}}(\varepsilon^{-2}) iterations. Then we show that our framework can also be applied to the case where the objective of the "max-player" is concave. In this case, we propose a multi-step gradient descent-ascent algorithm that finds an \varepsilon--first order stationary point of the game in \widetilde{\cal O}(\varepsilon^{-3.5}) iterations, which is the best known rate in the literature. We applied our algorithm to a fair classification problem of Fashion-MNIST dataset and observed that the proposed algorithm results in smoother training and better generalization. △ Less

Submitted 30 October, 2019; v1 submitted 21 February, 2019; originally announced February 2019.

arXiv:1812.08918 [pdf, other]

Computational RAM to Accelerate String Matching at Scale

Authors: Zamshed I. Chowdhury, S. Karen Khatamifard, Zhengyang Zhao, Masoud Zabihi, Salonik Resch, Meisam Razaviyayn, Jian-Ping Wang, Sachin Sapatnekar, Ulya R. Karpuzcu

Abstract: Traditional Von Neumann computing is falling apart in the era of exploding data volumes as the overhead of data transfer becomes forbidding. Instead, it is more energy-efficient to fuse compute capability with memory where the data reside. This is particularly critical for pattern matching, a key computational step in large-scale data analytics, which involves repetitive search over very large dat… ▽ More Traditional Von Neumann computing is falling apart in the era of exploding data volumes as the overhead of data transfer becomes forbidding. Instead, it is more energy-efficient to fuse compute capability with memory where the data reside. This is particularly critical for pattern matching, a key computational step in large-scale data analytics, which involves repetitive search over very large databases residing in memory. Emerging spintronic technologies show remarkable versatility for the tight integration of logic and memory. In this paper, we introduce CRAM-PM, a novel high-density, reconfigurable spintronic in-memory compute substrate for pattern matching. △ Less

Submitted 20 December, 2018; originally announced December 2018.

arXiv:1812.02878 [pdf, ps, other]

Solving Non-Convex Non-Concave Min-Max Games Under Polyak-Łojasiewicz Condition

Authors: Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee

Abstract: In this short note, we consider the problem of solving a min-max zero-sum game. This problem has been extensively studied in the convex-concave regime where the global solution can be computed efficiently. Recently, there have also been developments for finding the first order stationary points of the game when one of the player's objective is concave or (weakly) concave. This work focuses on the… ▽ More In this short note, we consider the problem of solving a min-max zero-sum game. This problem has been extensively studied in the convex-concave regime where the global solution can be computed efficiently. Recently, there have also been developments for finding the first order stationary points of the game when one of the player's objective is concave or (weakly) concave. This work focuses on the non-convex non-concave regime where the objective of one of the players satisfies Polyak-Łojasiewicz (PL) Condition. For such a game, we show that a simple multi-step gradient descent-ascent algorithm finds an $\varepsilon$--first order stationary point of the problem in $\widetilde{\mathcal{O}}(\varepsilon^{-2})$ iterations. △ Less

Submitted 6 December, 2018; originally announced December 2018.

arXiv:1810.05251 [pdf, other]

A Linearly Convergent Doubly Stochastic Gauss-Seidel Algorithm for Solving Linear Equations and A Certain Class of Over-Parameterized Optimization Problems

Authors: Meisam Razaviyayn, Mingyi Hong, Navid Reyhanian, Zhi-Quan Luo

Abstract: Consider the classical problem of solving a general linear system of equations $Ax=b$. It is well known that the (successively over relaxed) Gauss-Seidel scheme and many of its variants may not converge when $A$ is neither diagonally dominant nor symmetric positive definite. Can we have a linearly convergent G-S type algorithm that works for {\it any} $A$? In this paper we answer this question aff… ▽ More Consider the classical problem of solving a general linear system of equations $Ax=b$. It is well known that the (successively over relaxed) Gauss-Seidel scheme and many of its variants may not converge when $A$ is neither diagonally dominant nor symmetric positive definite. Can we have a linearly convergent G-S type algorithm that works for {\it any} $A$? In this paper we answer this question affirmatively by proposing a doubly stochastic G-S algorithm that is provably linearly convergent (in the mean square error sense) for any feasible linear system of equations. The key in the algorithm design is to introduce a {\it nonuniform double stochastic} scheme for picking the equation and the variable in each update step as well as a stepsize rule. These techniques also generalize to certain iterative alternating projection algorithms for solving the linear feasibility problem $A x\le b$ with an arbitrary $A$, as well as high-dimensional minimization problems for training over-parameterized models in machine learning. Our results demonstrate that a carefully designed randomization scheme can make an otherwise divergent G-S algorithm converge. △ Less

Submitted 13 May, 2019; v1 submitted 11 October, 2018; originally announced October 2018.

Showing 1–50 of 76 results for author: Razaviyayn, M