-
One Attack to Rule Them All: Tight Quadratic Bounds for Adaptive Queries on Cardinality Sketches
Authors:
Edith Cohen,
Jelani Nelson,
Tamás Sarlós,
Mihir Singhal,
Uri Stemmer
Abstract:
Cardinality sketches are compact data structures for representing sets or vectors. These sketches are space-efficient, typically requiring only logarithmic storage in the input size, and enable approximation of cardinality (or the number of nonzero entries). A crucial property in applications is \emph{composability}, meaning that the sketch of a union of sets can be computed from individual sketch…
▽ More
Cardinality sketches are compact data structures for representing sets or vectors. These sketches are space-efficient, typically requiring only logarithmic storage in the input size, and enable approximation of cardinality (or the number of nonzero entries). A crucial property in applications is \emph{composability}, meaning that the sketch of a union of sets can be computed from individual sketches. Existing designs provide strong statistical guarantees, ensuring that a randomly sampled sketching map remains robust for an exponential number of queries in terms of the sketch size $k$. However, these guarantees degrade to quadratic in $k$ when queries are \emph{adaptive}, meaning they depend on previous responses.
Prior works on statistical queries (Steinke and Ullman, 2015) and specific MinHash cardinality sketches (Ahmadian and Cohen, 2024) established that this is tight in that they can be compromised using a quadratic number of adaptive queries. In this work, we develop a universal attack framework that applies to broad classes of cardinality sketches. We show that any union-composable sketching map can be compromised with $\tilde{O}(k^4)$ adaptive queries and this improves to a tight bound of $\tilde{O}(k^2)$ for monotone maps (including MinHash, statistical queries, and Boolean linear maps). Similarly, any linear sketching map over the reals $\mathbb{R}$ and finite fields $\mathbb{F}_p$ can be compromised using $\tilde{O}(k^2)$ adaptive queries, which is optimal and strengthens some of the recent results by~\citet{GribelyukLWYZ:FOCS2024}, who established a weaker polynomial bound.
△ Less
Submitted 13 March, 2025; v1 submitted 10 November, 2024;
originally announced November 2024.
-
Fast Tree-Field Integrators: From Low Displacement Rank to Topological Transformers
Authors:
Krzysztof Choromanski,
Arijit Sehanobish,
Somnath Basu Roy Chowdhury,
Han Lin,
Avinava Dubey,
Tamas Sarlos,
Snigdha Chaturvedi
Abstract:
We present a new class of fast polylog-linear algorithms based on the theory of structured matrices (in particular low displacement rank) for integrating tensor fields defined on weighted trees. Several applications of the resulting fast tree-field integrators (FTFIs) are presented, including (a) approximation of graph metrics with tree metrics, (b) graph classification, (c) modeling on meshes, an…
▽ More
We present a new class of fast polylog-linear algorithms based on the theory of structured matrices (in particular low displacement rank) for integrating tensor fields defined on weighted trees. Several applications of the resulting fast tree-field integrators (FTFIs) are presented, including (a) approximation of graph metrics with tree metrics, (b) graph classification, (c) modeling on meshes, and finally (d) Topological Transformers (TTs) (Choromanski et al., 2022) for images. For Topological Transformers, we propose new relative position encoding (RPE) masking mechanisms with as few as three extra learnable parameters per Transformer layer, leading to 1.0-1.5%+ accuracy gains. Importantly, most of FTFIs are exact methods, thus numerically equivalent to their brute-force counterparts. When applied to graphs with thousands of nodes, those exact algorithms provide 5.7-13x speedups. We also provide an extensive theoretical analysis of our methods.
△ Less
Submitted 6 December, 2024; v1 submitted 22 June, 2024;
originally announced June 2024.
-
Lower Bounds for Differential Privacy Under Continual Observation and Online Threshold Queries
Authors:
Edith Cohen,
Xin Lyu,
Jelani Nelson,
Tamás Sarlós,
Uri Stemmer
Abstract:
One of the most basic problems for studying the "price of privacy over time" is the so called private counter problem, introduced by Dwork et al. (2010) and Chan et al. (2010). In this problem, we aim to track the number of events that occur over time, while hiding the existence of every single event. More specifically, in every time step $t\in[T]$ we learn (in an online fashion) that $Δ_t\geq 0$…
▽ More
One of the most basic problems for studying the "price of privacy over time" is the so called private counter problem, introduced by Dwork et al. (2010) and Chan et al. (2010). In this problem, we aim to track the number of events that occur over time, while hiding the existence of every single event. More specifically, in every time step $t\in[T]$ we learn (in an online fashion) that $Δ_t\geq 0$ new events have occurred, and must respond with an estimate $n_t\approx\sum_{j=1}^t Δ_j$. The privacy requirement is that all of the outputs together, across all time steps, satisfy event level differential privacy. The main question here is how our error needs to depend on the total number of time steps $T$ and the total number of events $n$. Dwork et al. (2015) showed an upper bound of $O\left(\log(T)+\log^2(n)\right)$, and Henzinger et al. (2023) showed a lower bound of $Ω\left(\min\{\log n, \log T\}\right)$. We show a new lower bound of $Ω\left(\min\{n,\log T\}\right)$, which is tight w.r.t. the dependence on $T$, and is tight in the sparse case where $\log^2 n=O(\log T)$. Our lower bound has the following implications:
$\bullet$ We show that our lower bound extends to the "online thresholds problem", where the goal is to privately answer many "quantile queries" when these queries are presented one-by-one. This resolves an open question of Bun et al. (2017).
$\bullet$ Our lower bound implies, for the first time, a separation between the number of mistakes obtainable by a private online learner and a non-private online learner. This partially resolves a COLT'22 open question published by Sanyal and Ramponi.
$\bullet$ Our lower bound also yields the first separation between the standard model of private online learning and a recently proposed relaxed variant of it, called private online prediction.
△ Less
Submitted 17 April, 2024; v1 submitted 28 February, 2024;
originally announced March 2024.
-
Hot PATE: Private Aggregation of Distributions for Diverse Task
Authors:
Edith Cohen,
Benjamin Cohen-Wang,
Xin Lyu,
Jelani Nelson,
Tamas Sarlos,
Uri Stemmer
Abstract:
The Private Aggregation of Teacher Ensembles (PATE) framework enables privacy-preserving machine learning by aggregating responses from disjoint subsets of sensitive data. Adaptations of PATE to tasks with inherent output diversity such as text generation face a core tension: preserving output diversity reduces teacher agreement, which in turn increases the noise required for differential privacy,…
▽ More
The Private Aggregation of Teacher Ensembles (PATE) framework enables privacy-preserving machine learning by aggregating responses from disjoint subsets of sensitive data. Adaptations of PATE to tasks with inherent output diversity such as text generation face a core tension: preserving output diversity reduces teacher agreement, which in turn increases the noise required for differential privacy, degrading utility. Yet suppressing diversity is counterproductive, as modern large language models encapsulate knowledge in their output distributions.
We propose Hot PATE, a variant tailored to settings where outputs are distributions. We formally define what it means to preserve diversity and introduce an efficient aggregation mechanism that transfers diversity to the randomized output without incurring additional privacy cost. Our method can be implemented with only API access to proprietary models and serves as a drop-in replacement for existing "cold" PATE aggregators. Empirically, Hot PATE achieves orders-of-magnitude improvement on in-context learning tasks.
△ Less
Submitted 17 May, 2025; v1 submitted 4 December, 2023;
originally announced December 2023.
-
SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention
Authors:
Isabel Leal,
Krzysztof Choromanski,
Deepali Jain,
Avinava Dubey,
Jake Varley,
Michael Ryoo,
Yao Lu,
Frederick Liu,
Vikas Sindhwani,
Quan Vuong,
Tamas Sarlos,
Ken Oslund,
Karol Hausman,
Kanishka Rao
Abstract:
We present Self-Adaptive Robust Attention for Robotics Transformers (SARA-RT): a new paradigm for addressing the emerging challenge of scaling up Robotics Transformers (RT) for on-robot deployment. SARA-RT relies on the new method of fine-tuning proposed by us, called up-training. It converts pre-trained or already fine-tuned Transformer-based robotic policies of quadratic time complexity (includi…
▽ More
We present Self-Adaptive Robust Attention for Robotics Transformers (SARA-RT): a new paradigm for addressing the emerging challenge of scaling up Robotics Transformers (RT) for on-robot deployment. SARA-RT relies on the new method of fine-tuning proposed by us, called up-training. It converts pre-trained or already fine-tuned Transformer-based robotic policies of quadratic time complexity (including massive billion-parameter vision-language-action models or VLAs), into their efficient linear-attention counterparts maintaining high quality. We demonstrate the effectiveness of SARA-RT by speeding up: (a) the class of recently introduced RT-2 models, the first VLA robotic policies pre-trained on internet-scale data, as well as (b) Point Cloud Transformer (PCT) robotic policies operating on large point clouds. We complement our results with the rigorous mathematical analysis providing deeper insight into the phenomenon of SARA.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
Hardness of Low Rank Approximation of Entrywise Transformed Matrix Products
Authors:
Tamas Sarlos,
Xingyou Song,
David Woodruff,
Qiuyi,
Zhang
Abstract:
Inspired by fast algorithms in natural language processing, we study low rank approximation in the entrywise transformed setting where we want to find a good rank $k$ approximation to $f(U \cdot V)$, where $U, V^\top \in \mathbb{R}^{n \times r}$ are given, $r = O(\log(n))$, and $f(x)$ is a general scalar function. Previous work in sublinear low rank approximation has shown that if both (1)…
▽ More
Inspired by fast algorithms in natural language processing, we study low rank approximation in the entrywise transformed setting where we want to find a good rank $k$ approximation to $f(U \cdot V)$, where $U, V^\top \in \mathbb{R}^{n \times r}$ are given, $r = O(\log(n))$, and $f(x)$ is a general scalar function. Previous work in sublinear low rank approximation has shown that if both (1) $U = V^\top$ and (2) $f(x)$ is a PSD kernel function, then there is an $O(nk^{ω-1})$ time constant relative error approximation algorithm, where $ω\approx 2.376$ is the exponent of matrix multiplication. We give the first conditional time hardness results for this problem, demonstrating that both conditions (1) and (2) are in fact necessary for getting better than $n^{2-o(1)}$ time for a relative error low rank approximation for a wide class of functions. We give novel reductions from the Strong Exponential Time Hypothesis (SETH) that rely on lower bounding the leverage scores of flat sparse vectors and hold even when the rank of the transformed matrix $f(UV)$ and the target rank are $n^{o(1)}$, and when $U = V^\top$. Furthermore, even when $f(x) = x^p$ is a simple polynomial, we give runtime lower bounds in the case when $U \neq V^\top$ of the form $Ω(\min(n^{2-o(1)}, Ω(2^p)))$. Lastly, we demonstrate that our lower bounds are tight by giving an $O(n \cdot \text{poly}(k, 2^p, 1/ε))$ time relative error approximation algorithm and a fast $O(n \cdot \text{poly}(k, p, 1/ε))$ additive error approximation using fast tensor-based sketching. Additionally, since our low rank algorithms rely on matrix-vector product subroutines, our lower bounds extend to show that computing $f(UV)W$, for even a small matrix $W$, requires $Ω(n^{2-o(1)})$ time.
△ Less
Submitted 3 November, 2023;
originally announced November 2023.
-
Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers
Authors:
Krzysztof Marcin Choromanski,
Shanda Li,
Valerii Likhosherstov,
Kumar Avinava Dubey,
Shengjie Luo,
Di He,
Yiming Yang,
Tamas Sarlos,
Thomas Weingarten,
Adrian Weller
Abstract:
We propose a new class of linear Transformers called FourierLearner-Transformers (FLTs), which incorporate a wide range of relative positional encoding mechanisms (RPEs). These include regular RPE techniques applied for sequential data, as well as novel RPEs operating on geometric data embedded in higher-dimensional Euclidean spaces. FLTs construct the optimal RPE mechanism implicitly by learning…
▽ More
We propose a new class of linear Transformers called FourierLearner-Transformers (FLTs), which incorporate a wide range of relative positional encoding mechanisms (RPEs). These include regular RPE techniques applied for sequential data, as well as novel RPEs operating on geometric data embedded in higher-dimensional Euclidean spaces. FLTs construct the optimal RPE mechanism implicitly by learning its spectral representation. As opposed to other architectures combining efficient low-rank linear attention with RPEs, FLTs remain practical in terms of their memory usage and do not require additional assumptions about the structure of the RPE mask. Besides, FLTs allow for applying certain structural inductive bias techniques to specify masking strategies, e.g. they provide a way to learn the so-called local RPEs introduced in this paper and give accuracy gains as compared with several other linear Transformers for language modeling. We also thoroughly test FLTs on other data modalities and tasks, such as image classification, 3D molecular modeling, and learnable optimizers. To the best of our knowledge, for 3D molecular data, FLTs are the first Transformer architectures providing linear attention and incorporating RPE masking.
△ Less
Submitted 3 April, 2024; v1 submitted 3 February, 2023;
originally announced February 2023.
-
Efficient Graph Field Integrators Meet Point Clouds
Authors:
Krzysztof Choromanski,
Arijit Sehanobish,
Han Lin,
Yunfan Zhao,
Eli Berger,
Tetiana Parshakova,
Alvin Pan,
David Watkins,
Tianyi Zhang,
Valerii Likhosherstov,
Somnath Basu Roy Chowdhury,
Avinava Dubey,
Deepali Jain,
Tamas Sarlos,
Snigdha Chaturvedi,
Adrian Weller
Abstract:
We present two new classes of algorithms for efficient field integration on graphs encoding point clouds. The first class, SeparatorFactorization(SF), leverages the bounded genus of point cloud mesh graphs, while the second class, RFDiffusion(RFD), uses popular epsilon-nearest-neighbor graph representations for point clouds. Both can be viewed as providing the functionality of Fast Multipole Metho…
▽ More
We present two new classes of algorithms for efficient field integration on graphs encoding point clouds. The first class, SeparatorFactorization(SF), leverages the bounded genus of point cloud mesh graphs, while the second class, RFDiffusion(RFD), uses popular epsilon-nearest-neighbor graph representations for point clouds. Both can be viewed as providing the functionality of Fast Multipole Methods (FMMs), which have had a tremendous impact on efficient integration, but for non-Euclidean spaces. We focus on geometries induced by distributions of walk lengths between points (e.g., shortest-path distance). We provide an extensive theoretical analysis of our algorithms, obtaining new results in structural graph theory as a byproduct. We also perform exhaustive empirical evaluation, including on-surface interpolation for rigid and deformable objects (particularly for mesh-dynamics modeling), Wasserstein distance computations for point clouds, and the Gromov-Wasserstein variant.
△ Less
Submitted 4 October, 2023; v1 submitted 2 February, 2023;
originally announced February 2023.
-
FAVOR#: Sharp Attention Kernel Approximations via New Classes of Positive Random Features
Authors:
Valerii Likhosherstov,
Krzysztof Choromanski,
Avinava Dubey,
Frederick Liu,
Tamas Sarlos,
Adrian Weller
Abstract:
The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in important applications ranging from kernel methods to efficient Transformers. We propose parameterized, positive, non-trigonometric RFs which approximate Gaussian…
▽ More
The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in important applications ranging from kernel methods to efficient Transformers. We propose parameterized, positive, non-trigonometric RFs which approximate Gaussian and softmax-kernels. In contrast to traditional RF approximations, parameters of these new methods can be optimized to reduce the variance of the approximation, and the optimum can be expressed in closed form. We show that our methods lead to variance reduction in practice ($e^{10}$-times smaller variance and beyond) and outperform previous methods in a kernel regression task. Using our proposed mechanism, we also present FAVOR#, a method for self-attention approximation in Transformers. We show that FAVOR# outperforms other random feature methods in speech modelling and natural language processing.
△ Less
Submitted 1 February, 2023;
originally announced February 2023.
-
Generalized Private Selection and Testing with High Confidence
Authors:
Edith Cohen,
Xin Lyu,
Jelani Nelson,
Tamás Sarlós,
Uri Stemmer
Abstract:
Composition theorems are general and powerful tools that facilitate privacy accounting across multiple data accesses from per-access privacy bounds. However they often result in weaker bounds compared with end-to-end analysis. Two popular tools that mitigate that are the exponential mechanism (or report noisy max) and the sparse vector technique. They were generalized in a couple of recent private…
▽ More
Composition theorems are general and powerful tools that facilitate privacy accounting across multiple data accesses from per-access privacy bounds. However they often result in weaker bounds compared with end-to-end analysis. Two popular tools that mitigate that are the exponential mechanism (or report noisy max) and the sparse vector technique. They were generalized in a couple of recent private selection/test frameworks, including the work by Liu and Talwar (STOC 2019), and Papernot and Steinke (ICLR 2022).
In this work, we first present an alternative framework for private selection and testing with a simpler privacy proof and equally-good utility guarantee. Second, we observe that the private selection framework (both previous ones and ours) can be applied to improve the accuracy/confidence trade-off for many fundamental privacy-preserving data-analysis tasks, including query releasing, top-$k$ selection, and stable selection.
Finally, for online settings, we apply the private testing to design a mechanism for adaptive query releasing, which improves the sample complexity dependence on the confidence parameter for the celebrated private multiplicative weights algorithm of Hardt and Rothblum (FOCS 2010).
△ Less
Submitted 9 February, 2023; v1 submitted 22 November, 2022;
originally announced November 2022.
-
Õptimal Differentially Private Learning of Thresholds and Quasi-Concave Optimization
Authors:
Edith Cohen,
Xin Lyu,
Jelani Nelson,
Tamás Sarlós,
Uri Stemmer
Abstract:
The problem of learning threshold functions is a fundamental one in machine learning. Classical learning theory implies sample complexity of $O(ξ^{-1} \log(1/β))$ (for generalization error $ξ$ with confidence $1-β$). The private version of the problem, however, is more challenging and in particular, the sample complexity must depend on the size $|X|$ of the domain. Progress on quantifying this dep…
▽ More
The problem of learning threshold functions is a fundamental one in machine learning. Classical learning theory implies sample complexity of $O(ξ^{-1} \log(1/β))$ (for generalization error $ξ$ with confidence $1-β$). The private version of the problem, however, is more challenging and in particular, the sample complexity must depend on the size $|X|$ of the domain. Progress on quantifying this dependence, via lower and upper bounds, was made in a line of works over the past decade. In this paper, we finally close the gap for approximate-DP and provide a nearly tight upper bound of $\tilde{O}(\log^* |X|)$, which matches a lower bound by Alon et al (that applies even with improper learning) and improves over a prior upper bound of $\tilde{O}((\log^* |X|)^{1.5})$ by Kaplan et al. We also provide matching upper and lower bounds of $\tildeΘ(2^{\log^*|X|})$ for the additive error of private quasi-concave optimization (a related and more general problem). Our improvement is achieved via the novel Reorder-Slice-Compute paradigm for private data analysis which we believe will have further applications.
△ Less
Submitted 11 November, 2022;
originally announced November 2022.
-
Tricking the Hashing Trick: A Tight Lower Bound on the Robustness of CountSketch to Adaptive Inputs
Authors:
Edith Cohen,
Jelani Nelson,
Tamás Sarlós,
Uri Stemmer
Abstract:
CountSketch and Feature Hashing (the "hashing trick") are popular randomized dimensionality reduction methods that support recovery of $\ell_2$-heavy hitters (keys $i$ where $v_i^2 > ε\|\boldsymbol{v}\|_2^2$) and approximate inner products. When the inputs are {\em not adaptive} (do not depend on prior outputs), classic estimators applied to a sketch of size $O(\ell/ε)$ are accurate for a number o…
▽ More
CountSketch and Feature Hashing (the "hashing trick") are popular randomized dimensionality reduction methods that support recovery of $\ell_2$-heavy hitters (keys $i$ where $v_i^2 > ε\|\boldsymbol{v}\|_2^2$) and approximate inner products. When the inputs are {\em not adaptive} (do not depend on prior outputs), classic estimators applied to a sketch of size $O(\ell/ε)$ are accurate for a number of queries that is exponential in $\ell$. When inputs are adaptive, however, an adversarial input can be constructed after $O(\ell)$ queries with the classic estimator and the best known robust estimator only supports $\tilde{O}(\ell^2)$ queries. In this work we show that this quadratic dependence is in a sense inherent: We design an attack that after $O(\ell^2)$ queries produces an adversarial input vector whose sketch is highly biased. Our attack uses "natural" non-adaptive inputs (only the final adversarial input is chosen adaptively) and universally applies with any correct estimator, including one that is unknown to the attacker. In that, we expose inherent vulnerability of this fundamental method.
△ Less
Submitted 28 August, 2022; v1 submitted 3 July, 2022;
originally announced July 2022.
-
Chefs' Random Tables: Non-Trigonometric Random Features
Authors:
Valerii Likhosherstov,
Krzysztof Choromanski,
Avinava Dubey,
Frederick Liu,
Tamas Sarlos,
Adrian Weller
Abstract:
We introduce chefs' random tables (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps. We present variants of CRTs where RFs are positive, a key requirement for applications in recent low-rank Transformers. Further variance r…
▽ More
We introduce chefs' random tables (CRTs), a new class of non-trigonometric random features (RFs) to approximate Gaussian and softmax kernels. CRTs are an alternative to standard random kitchen sink (RKS) methods, which inherently rely on the trigonometric maps. We present variants of CRTs where RFs are positive, a key requirement for applications in recent low-rank Transformers. Further variance reduction is possible by leveraging statistics which are simple to compute. One instantiation of CRTs, the optimal positive random features (OPRFs), is to our knowledge the first RF method for unbiased softmax kernel estimation with positive and bounded RFs, resulting in exponentially small tails and much lower variance than its counterparts. As we show, orthogonal random features applied in OPRFs provide additional variance reduction for any dimensionality $d$ (not only asymptotically for sufficiently large $d$, as for RKS). We test CRTs on many tasks ranging from non-parametric classification to training Transformers for text, speech and image data, obtaining new state-of-the-art results for low-rank text Transformers, while providing linear space and time complexity.
△ Less
Submitted 30 May, 2022;
originally announced May 2022.
-
On the Robustness of CountSketch to Adaptive Inputs
Authors:
Edith Cohen,
Xin Lyu,
Jelani Nelson,
Tamás Sarlós,
Moshe Shechner,
Uri Stemmer
Abstract:
CountSketch is a popular dimensionality reduction technique that maps vectors to a lower dimension using randomized linear measurements. The sketch supports recovering $\ell_2$-heavy hitters of a vector (entries with $v[i]^2 \geq \frac{1}{k}\|\boldsymbol{v}\|^2_2$). We study the robustness of the sketch in adaptive settings where input vectors may depend on the output from prior inputs. Adaptive s…
▽ More
CountSketch is a popular dimensionality reduction technique that maps vectors to a lower dimension using randomized linear measurements. The sketch supports recovering $\ell_2$-heavy hitters of a vector (entries with $v[i]^2 \geq \frac{1}{k}\|\boldsymbol{v}\|^2_2$). We study the robustness of the sketch in adaptive settings where input vectors may depend on the output from prior inputs. Adaptive settings arise in processes with feedback or with adversarial attacks. We show that the classic estimator is not robust, and can be attacked with a number of queries of the order of the sketch size. We propose a robust estimator (for a slightly modified sketch) that allows for quadratic number of queries in the sketch size, which is an improvement factor of $\sqrt{k}$ (for $k$ heavy hitters) over prior work.
△ Less
Submitted 28 February, 2022;
originally announced February 2022.
-
From block-Toeplitz matrices to differential equations on graphs: towards a general theory for scalable masked Transformers
Authors:
Krzysztof Choromanski,
Han Lin,
Haoxian Chen,
Tianyi Zhang,
Arijit Sehanobish,
Valerii Likhosherstov,
Jack Parker-Holder,
Tamas Sarlos,
Adrian Weller,
Thomas Weingarten
Abstract:
In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topo…
▽ More
In this paper we provide, to the best of our knowledge, the first comprehensive approach for incorporating various masking mechanisms into Transformers architectures in a scalable way. We show that recent results on linear causal attention (Choromanski et al., 2021) and log-linear RPE-attention (Luo et al., 2021) are special cases of this general mechanism. However by casting the problem as a topological (graph-based) modulation of unmasked attention, we obtain several results unknown before, including efficient d-dimensional RPE-masking and graph-kernel masking. We leverage many mathematical techniques ranging from spectral analysis through dynamic programming and random walks to new algorithms for solving Markov processes on graphs. We provide a corresponding empirical evaluation.
△ Less
Submitted 27 March, 2023; v1 submitted 16 July, 2021;
originally announced July 2021.
-
ES-ENAS: Efficient Evolutionary Optimization for Large Hybrid Search Spaces
Authors:
Xingyou Song,
Krzysztof Choromanski,
Jack Parker-Holder,
Yunhao Tang,
Qiuyi Zhang,
Daiyi Peng,
Deepali Jain,
Wenbo Gao,
Aldo Pacchiano,
Tamas Sarlos,
Yuxiang Yang
Abstract:
In this paper, we approach the problem of optimizing blackbox functions over large hybrid search spaces consisting of both combinatorial and continuous parameters. We demonstrate that previous evolutionary algorithms which rely on mutation-based approaches, while flexible over combinatorial spaces, suffer from a curse of dimensionality in high dimensional continuous spaces both theoretically and e…
▽ More
In this paper, we approach the problem of optimizing blackbox functions over large hybrid search spaces consisting of both combinatorial and continuous parameters. We demonstrate that previous evolutionary algorithms which rely on mutation-based approaches, while flexible over combinatorial spaces, suffer from a curse of dimensionality in high dimensional continuous spaces both theoretically and empirically, which thus limits their scope over hybrid search spaces as well. In order to combat this curse, we propose ES-ENAS, a simple and modular joint optimization procedure combining the class of sample-efficient smoothed gradient techniques, commonly known as Evolutionary Strategies (ES), with combinatorial optimizers in a highly scalable and intuitive way, inspired by the one-shot or supernet paradigm introduced in Efficient Neural Architecture Search (ENAS). By doing so, we achieve significantly more sample efficiency, which we empirically demonstrate over synthetic benchmarks, and are further able to apply ES-ENAS for architecture search over popular RL benchmarks.
△ Less
Submitted 15 March, 2023; v1 submitted 18 January, 2021;
originally announced January 2021.
-
Differentially Private Weighted Sampling
Authors:
Edith Cohen,
Ofir Geri,
Tamas Sarlos,
Uri Stemmer
Abstract:
Common datasets have the form of elements with keys (e.g., transactions and products) and the goal is to perform analytics on the aggregated form of key and frequency pairs. A weighted sample of keys by (a function of) frequency is a highly versatile summary that provides a sparse set of representative keys and supports approximate evaluations of query statistics. We propose private weighted sampl…
▽ More
Common datasets have the form of elements with keys (e.g., transactions and products) and the goal is to perform analytics on the aggregated form of key and frequency pairs. A weighted sample of keys by (a function of) frequency is a highly versatile summary that provides a sparse set of representative keys and supports approximate evaluations of query statistics. We propose private weighted sampling (PWS): A method that ensures element-level differential privacy while retaining, to the extent possible, the utility of a respective non-private weighted sample. PWS maximizes the reporting probabilities of keys and estimation quality of a broad family of statistics. PWS improves over the state of the art also for the well-studied special case of private histograms, when no sampling is performed. We empirically demonstrate significant performance gains compared with prior baselines: 20%-300% increase in key reporting for common Zipfian frequency distributions and accuracy for $\times 2$-$ 8$ lower frequencies in estimation tasks. Moreover, PWS is applied as a simple post-processing of a non-private sample, without requiring the original data. This allows for seamless integration with existing implementations of non-private schemes and retaining the efficiency of schemes designed for resource-constrained settings such as massive distributed or streamed data. We believe that due to practicality and performance, PWS may become a method of choice in applications where privacy is desired.
△ Less
Submitted 31 March, 2021; v1 submitted 25 October, 2020;
originally announced October 2020.
-
Rethinking Attention with Performers
Authors:
Krzysztof Choromanski,
Valerii Likhosherstov,
David Dohan,
Xingyou Song,
Andreea Gane,
Tamas Sarlos,
Peter Hawkins,
Jared Davis,
Afroz Mohiuddin,
Lukasz Kaiser,
David Belanger,
Lucy Colwell,
Adrian Weller
Abstract:
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random featu…
▽ More
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
△ Less
Submitted 19 November, 2022; v1 submitted 30 September, 2020;
originally announced September 2020.
-
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers
Authors:
Krzysztof Choromanski,
Valerii Likhosherstov,
David Dohan,
Xingyou Song,
Andreea Gane,
Tamas Sarlos,
Peter Hawkins,
Jared Davis,
David Belanger,
Lucy Colwell,
Adrian Weller
Abstract:
Transformer models have achieved state-of-the-art results across a diverse range of domains. However, concern over the cost of training the attention mechanism to learn complex dependencies between distant inputs continues to grow. In response, solutions that exploit the structure and sparsity of the learned attention matrix have blossomed. However, real-world applications that involve long sequen…
▽ More
Transformer models have achieved state-of-the-art results across a diverse range of domains. However, concern over the cost of training the attention mechanism to learn complex dependencies between distant inputs continues to grow. In response, solutions that exploit the structure and sparsity of the learned attention matrix have blossomed. However, real-world applications that involve long sequences, such as biological sequence analysis, may fall short of meeting these assumptions, precluding exploration of these models. To address this challenge, we present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR). Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. Furthermore, it provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence. It is also backwards-compatible with pre-trained regular Transformers. We demonstrate its effectiveness on the challenging task of protein sequence modeling and provide detailed theoretical analysis.
△ Less
Submitted 30 September, 2020; v1 submitted 5 June, 2020;
originally announced June 2020.
-
Stochastic Flows and Geometric Optimization on the Orthogonal Group
Authors:
Krzysztof Choromanski,
David Cheikhi,
Jared Davis,
Valerii Likhosherstov,
Achille Nazaret,
Achraf Bahamou,
Xingyou Song,
Mrugank Akarte,
Jack Parker-Holder,
Jacob Bergquist,
Yuan Gao,
Aldo Pacchiano,
Tamas Sarlos,
Adrian Weller,
Vikas Sindhwani
Abstract:
We present a new class of stochastic, geometrically-driven optimization algorithms on the orthogonal group $O(d)$ and naturally reductive homogeneous manifolds obtained from the action of the rotation group $SO(d)$. We theoretically and experimentally demonstrate that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinf…
▽ More
We present a new class of stochastic, geometrically-driven optimization algorithms on the orthogonal group $O(d)$ and naturally reductive homogeneous manifolds obtained from the action of the rotation group $SO(d)$. We theoretically and experimentally demonstrate that our methods can be applied in various fields of machine learning including deep, convolutional and recurrent neural networks, reinforcement learning, normalizing flows and metric learning. We show an intriguing connection between efficient stochastic optimization on the orthogonal group and graph theory (e.g. matching problem, partition functions over graphs, graph-coloring). We leverage the theory of Lie groups and provide theoretical results for the designed class of algorithms. We demonstrate broad applicability of our methods by showing strong performance on the seemingly unrelated tasks of learning world models to obtain stable policies for the most difficult $\mathrm{Humanoid}$ agent from $\mathrm{OpenAI}$ $\mathrm{Gym}$ and improving convolutional neural networks.
△ Less
Submitted 30 March, 2020;
originally announced March 2020.
-
Reinforcement Learning with Chromatic Networks for Compact Architecture Search
Authors:
Xingyou Song,
Krzysztof Choromanski,
Jack Parker-Holder,
Yunhao Tang,
Wenbo Gao,
Aldo Pacchiano,
Tamas Sarlos,
Deepali Jain,
Yuxiang Yang
Abstract:
We present a neural architecture search algorithm to construct compact reinforcement learning (RL) policies, by combining ENAS and ES in a highly scalable and intuitive way. By defining the combinatorial search space of NAS to be the set of different edge-partitionings (colorings) into same-weight classes, we represent compact architectures via efficient learned edge-partitionings. For several RL…
▽ More
We present a neural architecture search algorithm to construct compact reinforcement learning (RL) policies, by combining ENAS and ES in a highly scalable and intuitive way. By defining the combinatorial search space of NAS to be the set of different edge-partitionings (colorings) into same-weight classes, we represent compact architectures via efficient learned edge-partitionings. For several RL tasks, we manage to learn colorings translating to effective policies parameterized by as few as $17$ weight parameters, providing >90% compression over vanilla policies and 6x compression over state-of-the-art compact policies based on Toeplitz matrices, while still maintaining good reward. We believe that our work is one of the first attempts to propose a rigorous approach to training structured neural network architectures for RL problems that are of interest especially in mobile robotics with limited storage and computational resources.
△ Less
Submitted 6 April, 2021; v1 submitted 10 July, 2019;
originally announced July 2019.
-
Matrix-Free Preconditioning in Online Learning
Authors:
Ashok Cutkosky,
Tamas Sarlos
Abstract:
We provide an online convex optimization algorithm with regret that interpolates between the regret of an algorithm using an optimal preconditioning matrix and one using a diagonal preconditioning matrix. Our regret bound is never worse than that obtained by diagonal preconditioning, and in certain setting even surpasses that of algorithms with full-matrix preconditioning. Importantly, our algorit…
▽ More
We provide an online convex optimization algorithm with regret that interpolates between the regret of an algorithm using an optimal preconditioning matrix and one using a diagonal preconditioning matrix. Our regret bound is never worse than that obtained by diagonal preconditioning, and in certain setting even surpasses that of algorithms with full-matrix preconditioning. Importantly, our algorithm runs in the same time and space complexity as online gradient descent. Along the way we incorporate new techniques that mildly streamline and improve logarithmic factors in prior regret analyses. We conclude by benchmarking our algorithm on synthetic data and deep learning tasks.
△ Less
Submitted 29 May, 2019;
originally announced May 2019.
-
Orthogonal Estimation of Wasserstein Distances
Authors:
Mark Rowland,
Jiri Hron,
Yunhao Tang,
Krzysztof Choromanski,
Tamas Sarlos,
Adrian Weller
Abstract:
Wasserstein distances are increasingly used in a wide variety of applications in machine learning. Sliced Wasserstein distances form an important subclass which may be estimated efficiently through one-dimensional sorting operations. In this paper, we propose a new variant of sliced Wasserstein distance, study the use of orthogonal coupling in Monte Carlo estimation of Wasserstein distances and dr…
▽ More
Wasserstein distances are increasingly used in a wide variety of applications in machine learning. Sliced Wasserstein distances form an important subclass which may be estimated efficiently through one-dimensional sorting operations. In this paper, we propose a new variant of sliced Wasserstein distance, study the use of orthogonal coupling in Monte Carlo estimation of Wasserstein distances and draw connections with stratified sampling, and evaluate our approaches experimentally in a range of large-scale experiments in generative modelling and reinforcement learning.
△ Less
Submitted 5 April, 2019; v1 submitted 9 March, 2019;
originally announced March 2019.
-
Linear Additive Markov Processes
Authors:
Ravi Kumar,
Maithra Raghu,
Tamas Sarlos,
Andrew Tomkins
Abstract:
We introduce LAMP: the Linear Additive Markov Process. Transitions in LAMP may be influenced by states visited in the distant history of the process, but unlike higher-order Markov processes, LAMP retains an efficient parametrization. LAMP also allows the specific dependence on history to be learned efficiently from data. We characterize some theoretical properties of LAMP, including its steady-st…
▽ More
We introduce LAMP: the Linear Additive Markov Process. Transitions in LAMP may be influenced by states visited in the distant history of the process, but unlike higher-order Markov processes, LAMP retains an efficient parametrization. LAMP also allows the specific dependence on history to be learned efficiently from data. We characterize some theoretical properties of LAMP, including its steady-state and mixing time. We then give an algorithm based on alternating minimization to learn LAMP models from data. Finally, we perform a series of real-world experiments to show that LAMP is more powerful than first-order Markov processes, and even holds its own against deep sequential models (LSTMs) with a negligible increase in parameter complexity.
△ Less
Submitted 4 April, 2017;
originally announced April 2017.
-
Structured adaptive and random spinners for fast machine learning computations
Authors:
Mariusz Bojarski,
Anna Choromanska,
Krzysztof Choromanski,
Francois Fagan,
Cedric Gouy-Pailler,
Anne Morvan,
Nourhan Sakr,
Tamas Sarlos,
Jamal Atif
Abstract:
We consider an efficient computational framework for speeding up several machine learning algorithms with almost no loss of accuracy. The proposed framework relies on projections via structured matrices that we call Structured Spinners, which are formed as products of three structured matrix-blocks that incorporate rotations. The approach is highly generic, i.e. i) structured matrices under consid…
▽ More
We consider an efficient computational framework for speeding up several machine learning algorithms with almost no loss of accuracy. The proposed framework relies on projections via structured matrices that we call Structured Spinners, which are formed as products of three structured matrix-blocks that incorporate rotations. The approach is highly generic, i.e. i) structured matrices under consideration can either be fully-randomized or learned, ii) our structured family contains as special cases all previously considered structured schemes, iii) the setting extends to the non-linear case where the projections are followed by non-linear functions, and iv) the method finds numerous applications including kernel approximations via random feature maps, dimensionality reduction algorithms, new fast cross-polytope LSH techniques, deep learning, convex optimization algorithms via Newton sketches, quantization with random projection trees, and more. The proposed framework comes with theoretical guarantees characterizing the capacity of the structured model in reference to its unstructured counterpart and is based on a general theoretical principle that we describe in the paper. As a consequence of our theoretical analysis, we provide the first theoretical guarantees for one of the most efficient existing LSH algorithms based on the HD3HD2HD1 structured matrix [Andoni et al., 2015]. The exhaustive experimental evaluation confirms the accuracy and efficiency of structured spinners for a variety of different applications.
△ Less
Submitted 26 November, 2016; v1 submitted 19 October, 2016;
originally announced October 2016.
-
TripleSpin - a generic compact paradigm for fast machine learning computations
Authors:
Krzysztof Choromanski,
Francois Fagan,
Cedric Gouy-Pailler,
Anne Morvan,
Tamas Sarlos,
Jamal Atif
Abstract:
We present a generic compact computational framework relying on structured random matrices that can be applied to speed up several machine learning algorithms with almost no loss of accuracy. The applications include new fast LSH-based algorithms, efficient kernel computations via random feature maps, convex optimization algorithms, quantization techniques and many more. Certain models of the pres…
▽ More
We present a generic compact computational framework relying on structured random matrices that can be applied to speed up several machine learning algorithms with almost no loss of accuracy. The applications include new fast LSH-based algorithms, efficient kernel computations via random feature maps, convex optimization algorithms, quantization techniques and many more. Certain models of the presented paradigm are even more compressible since they apply only bit matrices. This makes them suitable for deploying on mobile devices. All our findings come with strong theoretical guarantees. In particular, as a byproduct of the presented techniques and by using relatively new Berry-Esseen-type CLT for random vectors, we give the first theoretical guarantees for one of the most efficient existing LSH algorithms based on the $\textbf{HD}_{3}\textbf{HD}_{2}\textbf{HD}_{1}$ structured matrix ("Practical and Optimal LSH for Angular Distance"). These guarantees as well as theoretical results for other aforementioned applications follow from the same general theoretical principle that we present in the paper. Our structured family contains as special cases all previously considered structured schemes, including the recently introduced $P$-model. Experimental evaluation confirms the accuracy and efficiency of TripleSpin matrices.
△ Less
Submitted 6 June, 2016; v1 submitted 29 May, 2016;
originally announced May 2016.
-
Fastfood: Approximate Kernel Expansions in Loglinear Time
Authors:
Quoc Viet Le,
Tamas Sarlos,
Alexander Johannes Smola
Abstract:
Despite their successes, what makes kernel methods difficult to use in many large scale problems is the fact that storing and computing the decision function is typically expensive, especially at prediction time. In this paper, we overcome this difficulty by proposing Fastfood, an approximation that accelerates such computation significantly. Key to Fastfood is the observation that Hadamard matric…
▽ More
Despite their successes, what makes kernel methods difficult to use in many large scale problems is the fact that storing and computing the decision function is typically expensive, especially at prediction time. In this paper, we overcome this difficulty by proposing Fastfood, an approximation that accelerates such computation significantly. Key to Fastfood is the observation that Hadamard matrices, when combined with diagonal Gaussian matrices, exhibit properties similar to dense Gaussian random matrices. Yet unlike the latter, Hadamard and diagonal matrices are inexpensive to multiply and store. These two matrices can be used in lieu of Gaussian matrices in Random Kitchen Sinks proposed by Rahimi and Recht (2009) and thereby speeding up the computation for a large range of kernel functions. Specifically, Fastfood requires O(n log d) time and O(n) storage to compute n non-linear basis functions in d dimensions, a significant improvement from O(nd) computation and storage, without sacrificing accuracy.
Our method applies to any translation invariant and any dot-product kernel, such as the popular RBF kernels and polynomial kernels. We prove that the approximation is unbiased and has low variance. Experiments show that we achieve similar accuracy to full kernel expansions and Random Kitchen Sinks while being 100x faster and using 1000x less memory. These improvements, especially in terms of memory usage, make kernel methods more practical for applications that have large training sets and/or require real-time prediction.
△ Less
Submitted 13 August, 2014;
originally announced August 2014.
-
A Sparse Johnson--Lindenstrauss Transform
Authors:
Anirban Dasgupta,
Ravi Kumar,
Tamás Sarlós
Abstract:
Dimension reduction is a key algorithmic tool with many applications including nearest-neighbor search, compressed sensing and linear algebra in the streaming model. In this work we obtain a {\em sparse} version of the fundamental tool in dimension reduction --- the Johnson--Lindenstrauss transform. Using hashing and local densification, we construct a sparse projection matrix with just…
▽ More
Dimension reduction is a key algorithmic tool with many applications including nearest-neighbor search, compressed sensing and linear algebra in the streaming model. In this work we obtain a {\em sparse} version of the fundamental tool in dimension reduction --- the Johnson--Lindenstrauss transform. Using hashing and local densification, we construct a sparse projection matrix with just $\tilde{O}(\frac{1}ε)$ non-zero entries per column. We also show a matching lower bound on the sparsity for a large class of projection matrices. Our bounds are somewhat surprising, given the known lower bounds of $Ω(\frac{1}{ε^2})$ both on the number of rows of any projection matrix and on the sparsity of projection matrices generated by natural constructions.
Using this, we achieve an $\tilde{O}(\frac{1}ε)$ update time per non-zero element for a $(1\pmε)$-approximate projection, thereby substantially outperforming the $\tilde{O}(\frac{1}{ε^2})$ update time required by prior approaches. A variant of our method offers the same guarantees for sparse vectors, yet its $\tilde{O}(d)$ worst case running time matches the best approach of Ailon and Liberty.
△ Less
Submitted 23 April, 2010;
originally announced April 2010.
-
Faster Least Squares Approximation
Authors:
Petros Drineas,
Michael W. Mahoney,
S. Muthukrishnan,
Tamas Sarlos
Abstract:
Least squares approximation is a technique to find an approximate solution to a system of linear equations that has no exact solution. In a typical setting, one lets $n$ be the number of constraints and $d$ be the number of variables, with $n \gg d$. Then, existing exact methods find a solution vector in $O(nd^2)$ time. We present two randomized algorithms that provide very accurate relative-error…
▽ More
Least squares approximation is a technique to find an approximate solution to a system of linear equations that has no exact solution. In a typical setting, one lets $n$ be the number of constraints and $d$ be the number of variables, with $n \gg d$. Then, existing exact methods find a solution vector in $O(nd^2)$ time. We present two randomized algorithms that provide very accurate relative-error approximations to the optimal value and the solution vector of a least squares approximation problem more rapidly than existing exact algorithms. Both of our algorithms preprocess the data with the Randomized Hadamard Transform. One then uniformly randomly samples constraints and solves the smaller problem on those constraints, and the other performs a sparse random projection and solves the smaller problem on those projected coordinates. In both cases, solving the smaller problem provides relative-error approximations, and, if $n$ is sufficiently larger than $d$, the approximate solution can be computed in $O(nd \log d)$ time.
△ Less
Submitted 26 September, 2010; v1 submitted 7 October, 2007;
originally announced October 2007.