-
DNF Learning via Locally Mixing Random Walks
Authors:
Josh Alman,
Shivam Nadimpalli,
Shyamal Patel,
Rocco A. Servedio
Abstract:
We give two results on PAC learning DNF formulas using membership queries in the challenging "distribution-free" learning framework, where learning algorithms must succeed for an arbitrary and unknown distribution over $\{0,1\}^n$.
(1) We first give a quasi-polynomial time "list-decoding" algorithm for learning a single term of an unknown DNF formula. More precisely, for any target $s$-term DNF…
▽ More
We give two results on PAC learning DNF formulas using membership queries in the challenging "distribution-free" learning framework, where learning algorithms must succeed for an arbitrary and unknown distribution over $\{0,1\}^n$.
(1) We first give a quasi-polynomial time "list-decoding" algorithm for learning a single term of an unknown DNF formula. More precisely, for any target $s$-term DNF formula $f = T_1 \vee \cdots \vee T_s$ over $\{0,1\}^n$ and any unknown distribution $D$ over $\{0,1\}^n$, our algorithm, which uses membership queries and random examples from $D$, runs in $\textsf{quasipoly}(n,s)$ time and outputs a list $L$ of candidate terms such that with high probability some term $T_i$ of $f$ belongs to $L$.
(2) We then use result (1) to give a $\textsf{quasipoly}(n,s)$-time algorithm, in the distribution-free PAC learning model with membership queries, for learning the class of size-$s$ DNFs in which all terms have the same size. Our algorithm learns using a DNF hypothesis.
The key tool used to establish result (1) is a new result on "locally mixing random walks," which, roughly speaking, shows that a random walk on a graph that is covered by a small number of expanders has a non-negligible probability of mixing quickly in a subset of these expanders.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
Only Large Weights (And Not Skip Connections) Can Prevent the Perils of Rank Collapse
Authors:
Josh Alman,
Zhao Song
Abstract:
Attention mechanisms lie at the heart of modern large language models (LLMs). Straightforward algorithms for forward and backward (gradient) computation take quadratic time, and a line of work initiated by [Alman and Song NeurIPS 2023] and [Alman and Song NeurIPS 2024] has shown that quadratic time is necessary unless the model weights are small, in which case almost linear time algorithms are pos…
▽ More
Attention mechanisms lie at the heart of modern large language models (LLMs). Straightforward algorithms for forward and backward (gradient) computation take quadratic time, and a line of work initiated by [Alman and Song NeurIPS 2023] and [Alman and Song NeurIPS 2024] has shown that quadratic time is necessary unless the model weights are small, in which case almost linear time algorithms are possible. In this paper, we show that large weights are necessary to avoid a strong preclusion to representational strength we call layer collapse, which means that the entire network can be approximated well by a network with only a single layer. Thus, the quadratic running time of attention is unavoidable for expressive transformers.
The notion of layer collapse that we introduce is a variant on the notion of rank collapse from the work of [Dong, Cordonnier, and Loukas ICML 2021]. They showed that in Self Attention Networks with small weights and with skip connections, rank collapse must occur. This is typically interpreted as justifying the necessity of skip connections in expressive networks. However, our result shows that even with skip connections, if the weights are small, then layer collapse still occurs. Thus, only large weights, and not skip connections, can prevent these representational weaknesses.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
Fast RoPE Attention: Combining the Polynomial Method and Fast Fourier Transform
Authors:
Josh Alman,
Zhao Song
Abstract:
The transformer architecture has been widely applied to many machine learning tasks. A main bottleneck in the time to perform transformer computations is a task called attention computation. [Alman and Song, NeurIPS 2023] have shown that in the bounded entry regime, there is an almost linear time algorithm to approximate the attention computation. They also proved that the bounded entry assumption…
▽ More
The transformer architecture has been widely applied to many machine learning tasks. A main bottleneck in the time to perform transformer computations is a task called attention computation. [Alman and Song, NeurIPS 2023] have shown that in the bounded entry regime, there is an almost linear time algorithm to approximate the attention computation. They also proved that the bounded entry assumption is necessary for a fast algorithm assuming the popular Strong Exponential Time Hypothesis.
A new version of transformer which uses position embeddings has recently been very successful. At a high level, position embedding enables the model to capture the correlations between tokens while taking into account their position in the sequence. Perhaps the most popular and effective version is Rotary Position Embedding (RoPE), which was proposed by [Su, Lu, Pan, Murtadha, Wen, and Liu, Neurocomputing 2024].
A main downside of RoPE is that it complicates the attention computation problem, so that previous techniques for designing almost linear time algorithms no longer seem to work. In this paper, we show how to overcome this issue, and give a new algorithm to compute the RoPE attention in almost linear time in the bounded entry regime. (Again, known lower bounds imply that bounded entries are necessary.) Our new algorithm combines two techniques in a novel way: the polynomial method, which was used in prior fast attention algorithms, and the Fast Fourier Transform.
△ Less
Submitted 17 May, 2025;
originally announced May 2025.
-
Low Rank Matrix Rigidity: Tight Lower Bounds and Hardness Amplification
Authors:
Josh Alman,
Jingxun Liang
Abstract:
For an $N \times N$ matrix $A$, its rank-$r$ rigidity, denoted $\mathcal{R}_A(r)$, is the minimum number of entries of $A$ that one must change to make its rank become at most $r$. Determining the rigidity of interesting explicit families of matrices remains a major open problem, and is central to understanding the complexities of these matrices in many different models of computation and communic…
▽ More
For an $N \times N$ matrix $A$, its rank-$r$ rigidity, denoted $\mathcal{R}_A(r)$, is the minimum number of entries of $A$ that one must change to make its rank become at most $r$. Determining the rigidity of interesting explicit families of matrices remains a major open problem, and is central to understanding the complexities of these matrices in many different models of computation and communication. We focus in this paper on the Walsh-Hadamard transform and on the `distance matrix', whose rows and columns correspond to binary vectors, and whose entries calculate whether the row and column are close in Hamming distance. Our results also generalize to other Kronecker powers and `Majority powers' of fixed matrices. We prove two new results about such matrices.
First, we prove new rigidity lower bounds in the low-rank regime where $r < \log N$. For instance, we prove that over any finite field, there are constants $c_1, c_2 > 0$ such that the $N \times N$ Walsh-Hadamard matrix $H_n$ satisfies $$\mathcal{R}_{H_n}(c_1 \log N) \geq N^2 \left( \frac12 - N^{-c_2} \right),$$ and a similar lower bound for the other aforementioned matrices. This is tight, and is the new best rigidity lower bound for an explicit matrix family at this rank; the previous best was $\mathcal{R}(c_1 \log N) \geq c_3 N^2$ for a small constant $c_3>0$.
Second, we give new hardness amplification results, showing that rigidity lower bounds for these matrices for slightly higher rank would imply breakthrough rigidity lower bounds for much higher rank. For instance, if one could prove $$\mathcal{R}_{H_n}(\log^{1 + \varepsilon} N) \geq N^2 \left( \frac12 - N^{-1/2^{(\log \log N)^{o(1)}}} \right)$$ over any finite field for some $\varepsilon>0$, this would imply that $H_n$ is Razborov rigid, giving a breakthrough lower bound in communication complexity.
△ Less
Submitted 26 February, 2025;
originally announced February 2025.
-
Faster Algorithms for Average-Case Orthogonal Vectors and Closest Pair Problems
Authors:
Josh Alman,
Alexandr Andoni,
Hengjie Zhang
Abstract:
We study the average-case version of the Orthogonal Vectors problem, in which one is given as input $n$ vectors from $\{0,1\}^d$ which are chosen randomly so that each coordinate is $1$ independently with probability $p$. Kane and Williams [ITCS 2019] showed how to solve this problem in time $O(n^{2 - δ_p})$ for a constant $δ_p > 0$ that depends only on $p$. However, it was previously unclear how…
▽ More
We study the average-case version of the Orthogonal Vectors problem, in which one is given as input $n$ vectors from $\{0,1\}^d$ which are chosen randomly so that each coordinate is $1$ independently with probability $p$. Kane and Williams [ITCS 2019] showed how to solve this problem in time $O(n^{2 - δ_p})$ for a constant $δ_p > 0$ that depends only on $p$. However, it was previously unclear how to solve the problem faster in the hardest parameter regime where $p$ may depend on $d$.
The best prior algorithm was the best worst-case algorithm by Abboud, Williams and Yu [SODA 2014], which in dimension $d = c \cdot \log n$, solves the problem in time $n^{2 - Ω(1/\log c)}$. In this paper, we give a new algorithm which improves this to $n^{2 - Ω(\log\log c /\log c)}$ in the average case for any parameter $p$.
As in the prior work, our algorithm uses the polynomial method. We make use of a very simple polynomial over the reals, and use a new method to analyze its performance based on computing how its value degrades as the input vectors get farther from orthogonal.
To demonstrate the generality of our approach, we also solve the average-case version of the closest pair problem in the same running time.
△ Less
Submitted 29 October, 2024;
originally announced October 2024.
-
Improving the Leading Constant of Matrix Multiplication
Authors:
Josh Alman,
Hantao Yu
Abstract:
Algebraic matrix multiplication algorithms are designed by bounding the rank of matrix multiplication tensors, and then using a recursive method. However, designing algorithms in this way quickly leads to large constant factors: if one proves that the tensor for multiplying $n \times n$ matrices has rank $\leq t$, then the resulting recurrence shows that $M \times M$ matrices can be multiplied usi…
▽ More
Algebraic matrix multiplication algorithms are designed by bounding the rank of matrix multiplication tensors, and then using a recursive method. However, designing algorithms in this way quickly leads to large constant factors: if one proves that the tensor for multiplying $n \times n$ matrices has rank $\leq t$, then the resulting recurrence shows that $M \times M$ matrices can be multiplied using $O(n^2 \cdot M^{\log_n t})$ operations, where the leading constant scales proportionally to $n^2$. Even modest increases in $n$ can blow up the leading constant too much to be worth the slight decrease in the exponent of $M$. Meanwhile, the asymptotically best algorithms use very large $n$, such that $n^2$ is larger than the number of atoms in the visible universe!
In this paper, we give new ways to use tensor rank bounds to design matrix multiplication algorithms, which lead to smaller leading constants than the standard recursive method. Our main result shows that, if the tensor for multiplying $n \times n$ matrices has rank $\leq t$, then $M \times M$ matrices can be multiplied using only $n^{O(1/(\log n)^{0.33})} \cdot M^{\log_n t}$ operations. In other words, we improve the leading constant in general from $O(n^2)$ to $n^{O(1/(\log n)^{0.33})} < n^{o(1)}$. We then apply this and further improve the leading constant in a number of situations of interest. We show that, in the popularly-conjectured case where $ω=2$, a new, different recursive approach can lead to an improvement. We also show that the leading constant of the current asymptotically fastest matrix multiplication algorithm, and any algorithm designed using the group-theoretic method, can be further improved by taking advantage of additional structure of the underlying tensor identities.
△ Less
Submitted 27 October, 2024;
originally announced October 2024.
-
Fundamental Limitations on Subquadratic Alternatives to Transformers
Authors:
Josh Alman,
Hantao Yu
Abstract:
The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of app…
▽ More
The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of approaches, including designing heuristic algorithms for performing attention computations faster, and proposing alternatives to the attention mechanism which can be computed more quickly. For instance, state space models such as Mamba were designed to replace attention with an almost linear time alternative.
In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm. Thus, any model which can be evaluated in subquadratic time - whether because of subquadratic-time heuristics for attention, faster attention replacements like Mamba, or any other reason - cannot perform this task. In other words, in order to perform tasks that (implicitly or explicitly) involve document similarity, one may as well use Transformer and cannot avoid its quadratic running time.
△ Less
Submitted 22 May, 2025; v1 submitted 5 October, 2024;
originally announced October 2024.
-
Finer-Grained Hardness of Kernel Density Estimation
Authors:
Josh Alman,
Yunfeng Guan
Abstract:
In batch Kernel Density Estimation (KDE) for a kernel function $f$, we are given as input $2n$ points $x^{(1)}, \cdots, x^{(n)}, y^{(1)}, \cdots, y^{(n)}$ in dimension $m$, as well as a vector $v \in \mathbb{R}^n$. These inputs implicitly define the $n \times n$ kernel matrix $K$ given by $K[i,j] = f(x^{(i)}, y^{(j)})$. The goal is to compute a vector $v$ which approximates $K w$ with…
▽ More
In batch Kernel Density Estimation (KDE) for a kernel function $f$, we are given as input $2n$ points $x^{(1)}, \cdots, x^{(n)}, y^{(1)}, \cdots, y^{(n)}$ in dimension $m$, as well as a vector $v \in \mathbb{R}^n$. These inputs implicitly define the $n \times n$ kernel matrix $K$ given by $K[i,j] = f(x^{(i)}, y^{(j)})$. The goal is to compute a vector $v$ which approximates $K w$ with $|| Kw - v||_\infty < \varepsilon ||w||_1$. A recent line of work has proved fine-grained lower bounds conditioned on SETH. Backurs et al. first showed the hardness of KDE for Gaussian-like kernels with high dimension $m = Ω(\log n)$ and large scale $B = Ω(\log n)$. Alman et al. later developed new reductions in roughly this same parameter regime, leading to lower bounds for more general kernels, but only for very small error $\varepsilon < 2^{- \log^{Ω(1)} (n)}$.
In this paper, we refine the approach of Alman et al. to show new lower bounds in all parameter regimes, closing gaps between the known algorithms and lower bounds. In the setting where $m = C\log n$ and $B = o(\log n)$, we prove Gaussian KDE requires $n^{2-o(1)}$ time to achieve additive error $\varepsilon < Ω(m/B)^{-m}$, matching the performance of the polynomial method up to low-order terms. In the low dimensional setting $m = o(\log n)$, we show that Gaussian KDE requires $n^{2-o(1)}$ time to achieve $\varepsilon$ such that $\log \log (\varepsilon^{-1}) > \tilde Ω((\log n)/m)$, matching the error bound achievable by FMM up to low-order terms. To our knowledge, no nontrivial lower bound was previously known in this regime.
Our new lower bounds make use of an intricate analysis of a special case of the kernel matrix -- the `counting matrix'. As a key technical lemma, we give a novel approach to bounding the entries of its inverse by using Schur polynomials from algebraic combinatorics.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
More Asymmetry Yields Faster Matrix Multiplication
Authors:
Josh Alman,
Ran Duan,
Virginia Vassilevska Williams,
Yinzhan Xu,
Zixuan Xu,
Renfei Zhou
Abstract:
We present a new improvement on the laser method for designing fast matrix multiplication algorithms. The new method further develops the recent advances by [Duan, Wu, Zhou FOCS 2023] and [Vassilevska Williams, Xu, Xu, Zhou SODA 2024]. Surprisingly the new improvement is achieved by incorporating more asymmetry in the analysis, circumventing a fundamental tool of prior work that requires two of th…
▽ More
We present a new improvement on the laser method for designing fast matrix multiplication algorithms. The new method further develops the recent advances by [Duan, Wu, Zhou FOCS 2023] and [Vassilevska Williams, Xu, Xu, Zhou SODA 2024]. Surprisingly the new improvement is achieved by incorporating more asymmetry in the analysis, circumventing a fundamental tool of prior work that requires two of the three dimensions to be treated identically. The method yields a new bound on the square matrix multiplication exponent $$ω<2.371339,$$ improved from the previous bound of $ω<2.371552$. We also improve the bounds of the exponents for multiplying rectangular matrices of various shapes.
△ Less
Submitted 20 October, 2024; v1 submitted 25 April, 2024;
originally announced April 2024.
-
The Fine-Grained Complexity of Gradient Computation for Training Large Language Models
Authors:
Josh Alman,
Zhao Song
Abstract:
Large language models (LLMs) have made fundamental contributions over the last a few years. To train an LLM, one needs to alternatingly run `forward' computations and `backward' computations. The forward computation can be viewed as attention function evaluation, and the backward computation can be viewed as a gradient computation. In previous work by [Alman and Song, NeurIPS 2023], it was proved…
▽ More
Large language models (LLMs) have made fundamental contributions over the last a few years. To train an LLM, one needs to alternatingly run `forward' computations and `backward' computations. The forward computation can be viewed as attention function evaluation, and the backward computation can be viewed as a gradient computation. In previous work by [Alman and Song, NeurIPS 2023], it was proved that the forward step can be performed in almost-linear time in certain parameter regimes, but that there is no truly sub-quadratic time algorithm in the remaining parameter regimes unless the popular hypothesis SETH is false. In this work, we show nearly identical results for the harder-seeming problem of computing the gradient of loss function of one layer attention network, and thus for the entire process of LLM training. This completely characterizes the fine-grained complexity of every step of LLM training.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Generalizations of Matrix Multiplication can solve the Light Bulb Problem
Authors:
Josh Alman,
Hengjie Zhang
Abstract:
In the light bulb problem, one is given uniformly random vectors $x_1, \ldots, x_n, y_1, \ldots, y_n \in \{-1,1\}^d$. They are all chosen independently except a planted pair $(x_{i^*}, y_{j^*})$ is chosen with correlation $ρ>0$. The goal is to find the planted pair. This problem was introduced over 30 years ago by L.~Valiant, and is known to have many applications in data analysis, statistics, and…
▽ More
In the light bulb problem, one is given uniformly random vectors $x_1, \ldots, x_n, y_1, \ldots, y_n \in \{-1,1\}^d$. They are all chosen independently except a planted pair $(x_{i^*}, y_{j^*})$ is chosen with correlation $ρ>0$. The goal is to find the planted pair. This problem was introduced over 30 years ago by L.~Valiant, and is known to have many applications in data analysis, statistics, and learning theory.
The naive algorithm runs in $Ω(n^2)$ time, and algorithms based on Locality-Sensitive Hashing approach quadratic time as $ρ\to 0$. In 2012, G.~Valiant gave a breakthrough algorithm using fast matrix multiplication that runs in time $O(n^{(5-ω)/(4-ω)}) < O(n^{1.615})$, no matter how small $ρ>0$ is. This was subsequently refined by Karppa, Kaski, and Kohonen in 2016 to $O(n^{2 ω/ 3}) < O(n^{1.582})$.
In this paper, we propose a new approach which can replace matrix multiplication tensor with other tensors. Those tensors can omit some terms one is supposed to compute, and include additional error terms. Our new approach can make use of any tensors which previously had no known algorithmic applications, including tensors which arise naturally as intermediate steps in border rank methods and in the Laser method.
We further show that our approach can be combined with locality-sensitive hashing to design an algorithm whose running time improves as $ρ$ gets larger. To our knowledge, this is the first algorithm which combines fast matrix multiplication with hashing for the light bulb problem or any closest pair problem, and it leads to faster algorithms for small $ρ>0$.
We also introduce a new tensor $T_{2112}$, which has the same size of $2 \times 2$ matrix multiplication tensor, but runs faster than the Strassen's algorithm for light bulb problem.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
Authors:
Josh Alman,
Zhao Song
Abstract:
In the classical transformer attention scheme, we are given three $n \times d$ size matrices $Q, K, V$ (the query, key, and value tokens), and the goal is to compute a new $n \times d$ size matrix $D^{-1} \exp(QK^\top) V$ where $D = \mathrm{diag}( \exp(QK^\top) {\bf 1}_n )$. In this work, we study a generalization of attention which captures triple-wise correlations. This generalization is able to…
▽ More
In the classical transformer attention scheme, we are given three $n \times d$ size matrices $Q, K, V$ (the query, key, and value tokens), and the goal is to compute a new $n \times d$ size matrix $D^{-1} \exp(QK^\top) V$ where $D = \mathrm{diag}( \exp(QK^\top) {\bf 1}_n )$. In this work, we study a generalization of attention which captures triple-wise correlations. This generalization is able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers. The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in $n$. However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that bounded entries are both necessary and sufficient for quickly performing generalized computations:
$\bullet$ On the positive side, if all entries of the input matrices are bounded above by $o(\sqrt[3]{\log n})$ then we show how to approximate the ``tensor-type'' attention matrix in $n^{1+o(1)}$ time.
$\bullet$ On the negative side, we show that if the entries of the input matrices may be as large as $Ω(\sqrt[3]{\log n})$, then there is no algorithm that runs faster than $n^{3-o(1)}$ (assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory).
We also show that our construction, algorithms, and lower bounds naturally generalize to higher-order tensors and correlations. Interestingly, the higher the order of the tensors, the lower the bound on the entries needs to be for an efficient algorithm. Our results thus yield a natural tradeoff between the boundedness of the entries, and order of the tensor one may use for more expressive, efficient attention computation.
△ Less
Submitted 6 October, 2023;
originally announced October 2023.
-
Tensor Ranks and the Fine-Grained Complexity of Dynamic Programming
Authors:
Josh Alman,
Ethan Turok,
Hantao Yu,
Hengzhi Zhang
Abstract:
Generalizing work of Künnemann, Paturi, and Schneider [ICALP 2017], we study a wide class of high-dimensional dynamic programming (DP) problems in which one must find the shortest path between two points in a high-dimensional grid given a tensor of transition costs between nodes in the grid. This captures many classical problems which are solved using DP such as the knapsack problem, the airplane…
▽ More
Generalizing work of Künnemann, Paturi, and Schneider [ICALP 2017], we study a wide class of high-dimensional dynamic programming (DP) problems in which one must find the shortest path between two points in a high-dimensional grid given a tensor of transition costs between nodes in the grid. This captures many classical problems which are solved using DP such as the knapsack problem, the airplane refueling problem, and the minimal-weight polygon triangulation problem. We observe that for many of these problems, the tensor naturally has low tensor rank or low slice rank.
We then give new algorithms and a web of fine-grained reductions to tightly determine the complexity of these problems. For instance, we show that a polynomial speedup over the DP algorithm is possible when the tensor rank is a constant or the slice rank is 1, but that such a speedup is impossible if the tensor rank is slightly super-constant (assuming SETH) or the slice rank is at least 3 (assuming the APSP conjecture). We find that this characterizes the known complexities for many of these problems, and in some cases leads to new faster algorithms.
△ Less
Submitted 2 January, 2024; v1 submitted 9 September, 2023;
originally announced September 2023.
-
Fast Attention Requires Bounded Entries
Authors:
Josh Alman,
Zhao Song
Abstract:
In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices $Q, K, V \in [-B,B]^{n \times d}$, and the goal is to construct the matrix…
▽ More
In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices $Q, K, V \in [-B,B]^{n \times d}$, and the goal is to construct the matrix $\mathrm{Att}(Q,K,V) := \mathrm{diag}(A {\bf 1}_n)^{-1} A V \in \mathbb{R}^{n \times d}$, where $A = \exp(QK^\top/d)$ is the `attention matrix', and $\exp$ is applied entry-wise. Straightforward methods for this problem explicitly compute the $n \times n$ attention matrix $A$, and hence require time $Ω(n^2)$ even when $d = n^{o(1)}$ is small.
In this paper, we investigate whether faster algorithms are possible by implicitly making use of the matrix $A$. We present two results, showing that there is a sharp transition at $B = Θ(\sqrt{\log n})$.
$\bullet$ If $d = O(\log n)$ and $B = o(\sqrt{\log n})$, there is an $n^{1+o(1)}$ time algorithm to approximate $\mathrm{Att}(Q,K,V)$ up to $1/\mathrm{poly}(n)$ additive error.
$\bullet$ If $d = O(\log n)$ and $B = Θ(\sqrt{\log n})$, assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate $\mathrm{Att}(Q,K,V)$ up to $1/\mathrm{poly}(n)$ additive error in truly subquadratic time $n^{2 - Ω(1)}$.
This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.
△ Less
Submitted 9 May, 2023; v1 submitted 25 February, 2023;
originally announced February 2023.
-
Matrix Multiplication and Number On the Forehead Communication
Authors:
Josh Alman,
Jarosław Błasiok
Abstract:
Three-player Number On the Forehead communication may be thought of as a three-player Number In the Hand promise model, in which each player is given the inputs that are supposedly on the other two players' heads, and promised that they are consistent with the inputs of of the other players. The set of all allowed inputs under this promise may be thought of as an order-3 tensor. We surprisingly ob…
▽ More
Three-player Number On the Forehead communication may be thought of as a three-player Number In the Hand promise model, in which each player is given the inputs that are supposedly on the other two players' heads, and promised that they are consistent with the inputs of of the other players. The set of all allowed inputs under this promise may be thought of as an order-3 tensor. We surprisingly observe that this tensor is exactly the matrix multiplication tensor, which is widely studied in the design of fast matrix multiplication algorithms.
Using this connection, we prove a number of results about both Number On the Forehead communication and matrix multiplication, each by using known results or techniques about the other. For example, we show how the Laser method, a key technique used to design the best matrix multiplication algorithms, can also be used to design communication protocols for a variety of problems. We also show how known lower bounds for Number On the Forehead communication can be used to bound properties of the matrix multiplication tensor such as its zeroing out subrank. Finally, we substantially generalize known methods based on slice-rank for studying communication, and show how they directly relate to the matrix multiplication exponent $ω$.
△ Less
Submitted 22 February, 2023;
originally announced February 2023.
-
Bypass Exponential Time Preprocessing: Fast Neural Network Training via Weight-Data Correlation Preprocessing
Authors:
Josh Alman,
Jiehao Liang,
Zhao Song,
Ruizhe Zhang,
Danyang Zhuo
Abstract:
Over the last decade, deep neural networks have transformed our society, and they are already widely applied in various machine learning applications. State-of-art deep neural networks are becoming larger in size every year to deliver increasing model accuracy, and as a result, model training consumes substantial computing resources and will only consume more in the future. Using current training…
▽ More
Over the last decade, deep neural networks have transformed our society, and they are already widely applied in various machine learning applications. State-of-art deep neural networks are becoming larger in size every year to deliver increasing model accuracy, and as a result, model training consumes substantial computing resources and will only consume more in the future. Using current training methods, in each iteration, to process a data point $x \in \mathbb{R}^d$ in a layer, we need to spend $Θ(md)$ time to evaluate all the $m$ neurons in the layer. This means processing the entire layer takes $Θ(nmd)$ time for $n$ data points. Recent work [Song, Yang and Zhang, NeurIPS 2021] reduces this time per iteration to $o(nmd)$, but requires exponential time to preprocess either the data or the neural network weights, making it unlikely to have practical usage.
In this work, we present a new preprocessing method that simply stores the weight-data correlation in a tree data structure in order to quickly, dynamically detect which neurons fire at each iteration. Our method requires only $O(nmd)$ time in preprocessing and still achieves $o(nmd)$ time per iteration. We complement our new algorithm with a lower bound, proving that assuming a popular conjecture from complexity theory, one could not substantially speed up our algorithm for dynamic detection of firing neurons.
△ Less
Submitted 25 November, 2022;
originally announced November 2022.
-
Faster Walsh-Hadamard and Discrete Fourier Transforms From Matrix Non-Rigidity
Authors:
Josh Alman,
Kevin Rao
Abstract:
We give algorithms with lower arithmetic operation counts for both the Walsh-Hadamard Transform (WHT) and the Discrete Fourier Transform (DFT) on inputs of power-of-2 size $N$.
For the WHT, our new algorithm has an operation count of $\frac{23}{24}N \log N + O(N)$. To our knowledge, this gives the first improvement on the $N \log N$ operation count of the simple, folklore Fast Walsh-Hadamard Tra…
▽ More
We give algorithms with lower arithmetic operation counts for both the Walsh-Hadamard Transform (WHT) and the Discrete Fourier Transform (DFT) on inputs of power-of-2 size $N$.
For the WHT, our new algorithm has an operation count of $\frac{23}{24}N \log N + O(N)$. To our knowledge, this gives the first improvement on the $N \log N$ operation count of the simple, folklore Fast Walsh-Hadamard Transform algorithm.
For the DFT, our new FFT algorithm uses $\frac{15}{4}N \log N + O(N)$ real arithmetic operations. Our leading constant $\frac{15}{4} = 3.75$ improves on the leading constant of $5$ from the Cooley-Tukey algorithm from 1965, leading constant $4$ from the split-radix algorithm of Yavne from 1968, leading constant $\frac{34}{9}=3.777\ldots$ from a modification of the split-radix algorithm by Van Buskirk from 2004, and leading constant $3.76875$ from a theoretically optimized version of Van Buskirk's algorithm by Sergeev from 2017.
Our new WHT algorithm takes advantage of a recent line of work on the non-rigidity of the WHT: we decompose the WHT matrix as the sum of a low-rank matrix and a sparse matrix, and then analyze the structures of these matrices to achieve a lower operation count. Our new DFT algorithm comes from a novel reduction, showing that parts of the previous best FFT algorithms can be replaced by calls to an algorithm for the WHT. Replacing the folklore WHT algorithm with our new improved algorithm leads to our improved FFT.
△ Less
Submitted 14 June, 2023; v1 submitted 11 November, 2022;
originally announced November 2022.
-
Smaller Low-Depth Circuits for Kronecker Powers
Authors:
Josh Alman,
Yunfeng Guan,
Ashwin Padaki
Abstract:
We give new, smaller constructions of constant-depth linear circuits for computing any matrix which is the Kronecker power of a fixed matrix. A standard argument (e.g., the mixed product property of Kronecker products, or a generalization of the Fast Walsh-Hadamard transform) shows that any such $N \times N$ matrix has a depth-2 circuit of size $O(N^{1.5})$. We improve on this for all such matrice…
▽ More
We give new, smaller constructions of constant-depth linear circuits for computing any matrix which is the Kronecker power of a fixed matrix. A standard argument (e.g., the mixed product property of Kronecker products, or a generalization of the Fast Walsh-Hadamard transform) shows that any such $N \times N$ matrix has a depth-2 circuit of size $O(N^{1.5})$. We improve on this for all such matrices, and especially for some such matrices of particular interest:
- For any integer $q > 1$ and any matrix which is the Kronecker power of a fixed $q \times q$ matrix, we construct a depth-2 circuit of size $O(N^{1.5 - a_q})$, where $a_q > 0$ is a positive constant depending only on $q$. No bound beating size $O(N^{1.5})$ was previously known for any $q>2$.
- For the case $q=2$, i.e., for any matrix which is the Kronecker power of a fixed $2 \times 2$ matrix, we construct a depth-2 circuit of size $O(N^{1.446})$, improving the prior best size $O(N^{1.493})$ [Alman, 2021].
- For the Walsh-Hadamard transform, we construct a depth-2 circuit of size $O(N^{1.443})$, improving the prior best size $O(N^{1.476})$ [Alman, 2021].
- For the disjointness matrix (the communication matrix of set disjointness, or equivalently, the matrix for the linear transform that evaluates a multilinear polynomial on all $0/1$ inputs), we construct a depth-2 circuit of size $O(N^{1.258})$, improving the prior best size $O(N^{1.272})$ [Jukna and Sergeev, 2013].
Our constructions also generalize to improving the standard construction for any depth $\leq O(\log N)$. Our main technical tool is an improved way to convert a nontrivial circuit for any matrix into a circuit for its Kronecker powers. Our new bounds provably could not be achieved using the approaches of prior work.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Faster Walsh-Hadamard Transform and Matrix Multiplication over Finite Fields using Lookup Tables
Authors:
Josh Alman
Abstract:
We use lookup tables to design faster algorithms for important algebraic problems over finite fields. These faster algorithms, which only use arithmetic operations and lookup table operations, may help to explain the difficulty of determining the complexities of these important problems. Our results over a constant-sized finite field are as follows.
The Walsh-Hadamard transform of a vector of le…
▽ More
We use lookup tables to design faster algorithms for important algebraic problems over finite fields. These faster algorithms, which only use arithmetic operations and lookup table operations, may help to explain the difficulty of determining the complexities of these important problems. Our results over a constant-sized finite field are as follows.
The Walsh-Hadamard transform of a vector of length $N$ can be computed using $O(N \log N / \log \log N)$ bit operations. This generalizes to any transform defined as a Kronecker power of a fixed matrix. By comparison, the Fast Walsh-Hadamard transform (similar to the Fast Fourier transform) uses $O(N \log N)$ arithmetic operations, which is believed to be optimal up to constant factors.
Any algebraic algorithm for multiplying two $N \times N$ matrices using $O(N^ω)$ operations can be converted into an algorithm using $O(N^ω/ (\log N)^{ω/2 - 1})$ bit operations. For example, Strassen's algorithm can be converted into an algorithm using $O(N^{2.81} / (\log N)^{0.4})$ bit operations. It remains an open problem with practical implications to determine the smallest constant $c$ such that Strassen's algorithm can be implemented to use $c \cdot N^{2.81} + o(N^{2.81})$ arithmetic operations; using a lookup table allows one to save a super-constant factor in bit operations.
△ Less
Submitted 8 November, 2022;
originally announced November 2022.
-
Optimal-Degree Polynomial Approximations for Exponentials and Gaussian Kernel Density Estimation
Authors:
Amol Aggarwal,
Josh Alman
Abstract:
For any real numbers $B \ge 1$ and $δ\in (0, 1)$ and function $f: [0, B] \rightarrow \mathbb{R}$, let $d_{B; δ} (f) \in \mathbb{Z}_{> 0}$ denote the minimum degree of a polynomial $p(x)$ satisfying $\sup_{x \in [0, B]} \big| p(x) - f(x) \big| < δ$. In this paper, we provide precise asymptotics for $d_{B; δ} (e^{-x})$ and $d_{B; δ} (e^{x})$ in terms of both $B$ and $δ$, improving both the previousl…
▽ More
For any real numbers $B \ge 1$ and $δ\in (0, 1)$ and function $f: [0, B] \rightarrow \mathbb{R}$, let $d_{B; δ} (f) \in \mathbb{Z}_{> 0}$ denote the minimum degree of a polynomial $p(x)$ satisfying $\sup_{x \in [0, B]} \big| p(x) - f(x) \big| < δ$. In this paper, we provide precise asymptotics for $d_{B; δ} (e^{-x})$ and $d_{B; δ} (e^{x})$ in terms of both $B$ and $δ$, improving both the previously known upper bounds and lower bounds. In particular, we show $$d_{B; δ} (e^{-x}) = Θ\left( \max \left\{ \sqrt{B \log(δ^{-1})}, \frac{\log(δ^{-1}) }{ \log(B^{-1} \log(δ^{-1}))} \right\}\right), \text{ and}$$ $$d_{B; δ} (e^{x}) = Θ\left( \max \left\{ B, \frac{\log(δ^{-1}) }{ \log(B^{-1} \log(δ^{-1}))} \right\}\right).$$
Polynomial approximations for $e^{-x}$ and $e^x$ have applications to the design of algorithms for many problems, and our degree bounds show both the power and limitations of these algorithms.
We focus in particular on the Batch Gaussian Kernel Density Estimation problem for $n$ sample points in $Θ(\log n)$ dimensions with error $δ= n^{-Θ(1)}$. We show that the running time one can achieve depends on the square of the diameter of the point set, $B$, with a transition at $B = Θ(\log n)$ mirroring the corresponding transition in $d_{B; δ} (e^{-x})$:
- When $B=o(\log n)$, we give the first algorithm running in time $n^{1 + o(1)}$.
- When $B = κ\log n$ for a small constant $κ>0$, we give an algorithm running in time $n^{1 + O(\log \log κ^{-1} /\log κ^{-1})}$. The $\log \log κ^{-1} /\log κ^{-1}$ term in the exponent comes from analyzing the behavior of the leading constant in our computation of $d_{B; δ} (e^{-x})$.
- When $B = ω(\log n)$, we show that time $n^{2 - o(1)}$ is necessary assuming SETH.
△ Less
Submitted 12 May, 2022;
originally announced May 2022.
-
Parameterized Sensitivity Oracles and Dynamic Algorithms using Exterior Algebras
Authors:
Josh Alman,
Dean Hirsch
Abstract:
We design the first efficient sensitivity oracles and dynamic algorithms for a variety of parameterized problems. Our main approach is to modify the algebraic coding technique from static parameterized algorithm design, which had not previously been used in a dynamic context. We particularly build off of the `extensor coding' method of Brand, Dell and Husfeldt [STOC'18], employing properties of th…
▽ More
We design the first efficient sensitivity oracles and dynamic algorithms for a variety of parameterized problems. Our main approach is to modify the algebraic coding technique from static parameterized algorithm design, which had not previously been used in a dynamic context. We particularly build off of the `extensor coding' method of Brand, Dell and Husfeldt [STOC'18], employing properties of the exterior algebra over different fields.
For the $k$-Path detection problem for directed graphs, it is known that no efficient dynamic algorithm exists (under popular assumptions from fine-grained complexity). We circumvent this by designing an efficient sensitivity oracle, which preprocesses a directed graph on $n$ vertices in $2^k poly(k) n^{ω+o(1)}$ time, such that, given $\ell$ updates (mixing edge insertions and deletions, and vertex deletions) to that input graph, it can decide in time $\ell^2 2^kpoly(k)$ and with high probability, whether the updated graph contains a path of length $k$. We also give a deterministic sensitivity oracle requiring $4^k poly(k) n^{ω+o(1)}$ preprocessing time and $\ell^2 2^{ωk + o(k)}$ query time, and obtain a randomized sensitivity oracle for the task of approximately counting the number of $k$-paths. For $k$-Path detection in undirected graphs, we obtain a randomized sensitivity oracle with $O(1.66^k n^3)$ preprocessing time and $O(\ell^3 1.66^k)$ query time, and a better bound for undirected bipartite graphs.
In addition, we present the first fully dynamic algorithms for a variety of problems: $k$-Partial Cover, $m$-Set $k$-Packing, $t$-Dominating Set, $d$-Dimensional $k$-Matching, and Exact $k$-Partial Cover. For example, for $k$-Partial Cover we show a randomized dynamic algorithm with $2^k poly(k)polylog(n)$ update time, and a deterministic dynamic algorithm with $4^kpoly(k)polylog(n)$ update time.
△ Less
Submitted 18 June, 2022; v1 submitted 22 April, 2022;
originally announced April 2022.
-
Kronecker Products, Low-Depth Circuits, and Matrix Rigidity
Authors:
Josh Alman
Abstract:
For a matrix $M$ and a positive integer $r$, the rank $r$ rigidity of $M$ is the smallest number of entries of $M$ which one must change to make its rank at most $r$. There are many known applications of rigidity lower bounds to a variety of areas in complexity theory, but fewer known applications of rigidity upper bounds. In this paper, we use rigidity upper bounds to prove new upper bounds in a…
▽ More
For a matrix $M$ and a positive integer $r$, the rank $r$ rigidity of $M$ is the smallest number of entries of $M$ which one must change to make its rank at most $r$. There are many known applications of rigidity lower bounds to a variety of areas in complexity theory, but fewer known applications of rigidity upper bounds. In this paper, we use rigidity upper bounds to prove new upper bounds in a few different models of computation. Our results include:
$\bullet$ For any $d> 1$, and over any field $\mathbb{F}$, the $N \times N$ Walsh-Hadamard transform has a depth-$d$ linear circuit of size $O(d \cdot N^{1 + 0.96/d})$. This circumvents a known lower bound of $Ω(d \cdot N^{1 + 1/d})$ for circuits with bounded coefficients over $\mathbb{C}$ by Pudlák (2000), by using coefficients of magnitude polynomial in $N$. Our construction also generalizes to linear transformations given by a Kronecker power of any fixed $2 \times 2$ matrix.
$\bullet$ The $N \times N$ Walsh-Hadamard transform has a linear circuit of size $\leq (1.81 + o(1)) N \log_2 N$, improving on the bound of $\approx 1.88 N \log_2 N$ which one obtains from the standard fast Walsh-Hadamard transform.
$\bullet$ A new rigidity upper bound, showing that the following classes of matrices are not rigid enough to prove circuit lower bounds using Valiant's approach:
$-$ for any field $\mathbb{F}$ and any function $f : \{0,1\}^n \to \mathbb{F}$, the matrix $V_f \in \mathbb{F}^{2^n \times 2^n}$ given by, for any $x,y \in \{0,1\}^n$, $V_f[x,y] = f(x \wedge y)$, and
$-$ for any field $\mathbb{F}$ and any fixed-size matrices $M_1, \ldots, M_n \in \mathbb{F}^{q \times q}$, the Kronecker product $M_1 \otimes M_2 \otimes \cdots \otimes M_n$.
This generalizes recent results on non-rigidity, using a simpler approach which avoids needing the polynomial method.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
Metric Transforms and Low Rank Matrices via Representation Theory of the Real Hyperrectangle
Authors:
Josh Alman,
Timothy Chu,
Gary Miller,
Shyam Narayanan,
Mark Sellke,
Zhao Song
Abstract:
In this paper, we develop a new technique which we call representation theory of the real hyperrectangle, which describes how to compute the eigenvectors and eigenvalues of certain matrices arising from hyperrectangles. We show that these matrices arise naturally when analyzing a number of different algorithmic tasks such as kernel methods, neural network training, natural language processing, and…
▽ More
In this paper, we develop a new technique which we call representation theory of the real hyperrectangle, which describes how to compute the eigenvectors and eigenvalues of certain matrices arising from hyperrectangles. We show that these matrices arise naturally when analyzing a number of different algorithmic tasks such as kernel methods, neural network training, natural language processing, and the design of algorithms using the polynomial method. We then use our new technique along with these connections to prove several new structural results in these areas, including:
$\bullet$ A function is a positive definite Manhattan kernel if and only if it is a completely monotone function. These kernels are widely used across machine learning; one example is the Laplace kernel which is widely used in machine learning for chemistry.
$\bullet$ A function transforms Manhattan distances to Manhattan distances if and only if it is a Bernstein function. This completes the theory of Manhattan to Manhattan metric transforms initiated by Assouad in 1980.
$\bullet$ A function applied entry-wise to any square matrix of rank $r$ always results in a matrix of rank $< 2^{r-1}$ if and only if it is a polynomial of sufficiently low degree. This gives a converse to a key lemma used by the polynomial method in algorithm design.
Our work includes a sophisticated combination of techniques from different fields, including metric embeddings, the polynomial method, and group representation theory.
△ Less
Submitted 4 August, 2021; v1 submitted 23 November, 2020;
originally announced November 2020.
-
Algorithms and Hardness for Linear Algebra on Geometric Graphs
Authors:
Josh Alman,
Timothy Chu,
Aaron Schild,
Zhao Song
Abstract:
For a function $\mathsf{K} : \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}_{\geq 0}$, and a set $P = \{ x_1, \ldots, x_n\} \subset \mathbb{R}^d$ of $n$ points, the $\mathsf{K}$ graph $G_P$ of $P$ is the complete graph on $n$ nodes where the weight between nodes $i$ and $j$ is given by $\mathsf{K}(x_i, x_j)$. In this paper, we initiate the study of when efficient spectral graph theory is poss…
▽ More
For a function $\mathsf{K} : \mathbb{R}^{d} \times \mathbb{R}^{d} \to \mathbb{R}_{\geq 0}$, and a set $P = \{ x_1, \ldots, x_n\} \subset \mathbb{R}^d$ of $n$ points, the $\mathsf{K}$ graph $G_P$ of $P$ is the complete graph on $n$ nodes where the weight between nodes $i$ and $j$ is given by $\mathsf{K}(x_i, x_j)$. In this paper, we initiate the study of when efficient spectral graph theory is possible on these graphs. We investigate whether or not it is possible to solve the following problems in $n^{1+o(1)}$ time for a $\mathsf{K}$-graph $G_P$ when $d < n^{o(1)}$:
$\bullet$ Multiply a given vector by the adjacency matrix or Laplacian matrix of $G_P$
$\bullet$ Find a spectral sparsifier of $G_P$
$\bullet$ Solve a Laplacian system in $G_P$'s Laplacian matrix
For each of these problems, we consider all functions of the form $\mathsf{K}(u,v) = f(\|u-v\|_2^2)$ for a function $f:\mathbb{R} \rightarrow \mathbb{R}$. We provide algorithms and comparable hardness results for many such $\mathsf{K}$, including the Gaussian kernel, Neural tangent kernels, and more. For example, in dimension $d = Ω(\log n)$, we show that there is a parameter associated with the function $f$ for which low parameter values imply $n^{1+o(1)}$ time algorithms for all three of these problems and high parameter values imply the nonexistence of subquadratic time algorithms assuming Strong Exponential Time Hypothesis ($\mathsf{SETH}$), given natural assumptions on $f$.
As part of our results, we also show that the exponential dependence on the dimension $d$ in the celebrated fast multipole method of Greengard and Rokhlin cannot be improved, assuming $\mathsf{SETH}$, for a broad class of functions $f$. To the best of our knowledge, this is the first formal limitation proven about fast multipole methods.
△ Less
Submitted 4 November, 2020;
originally announced November 2020.
-
A Refined Laser Method and Faster Matrix Multiplication
Authors:
Josh Alman,
Virginia Vassilevska Williams
Abstract:
The complexity of matrix multiplication is measured in terms of $ω$, the smallest real number such that two $n\times n$ matrices can be multiplied using $O(n^{ω+ε})$ field operations for all $ε>0$; the best bound until now is $ω<2.37287$ [Le Gall'14]. All bounds on $ω$ since 1986 have been obtained using the so-called laser method, a way to lower-bound the `value' of a tensor in designing matrix m…
▽ More
The complexity of matrix multiplication is measured in terms of $ω$, the smallest real number such that two $n\times n$ matrices can be multiplied using $O(n^{ω+ε})$ field operations for all $ε>0$; the best bound until now is $ω<2.37287$ [Le Gall'14]. All bounds on $ω$ since 1986 have been obtained using the so-called laser method, a way to lower-bound the `value' of a tensor in designing matrix multiplication algorithms. The main result of this paper is a refinement of the laser method that improves the resulting value bound for most sufficiently large tensors. Thus, even before computing any specific values, it is clear that we achieve an improved bound on $ω$, and we indeed obtain the best bound on $ω$ to date: $$ω< 2.37286.$$ The improvement is of the same magnitude as the improvement that [Le Gall'14] obtained over the previous bound [Vassilevska W.'12]. Our improvement to the laser method is quite general, and we believe it will have further applications in arithmetic complexity.
△ Less
Submitted 3 September, 2024; v1 submitted 12 October, 2020;
originally announced October 2020.
-
Faster Update Time for Turnstile Streaming Algorithms
Authors:
Josh Alman,
Huacheng Yu
Abstract:
In this paper, we present a new algorithm for maintaining linear sketches in turnstile streams with faster update time. As an application, we show that $\log n$ \texttt{Count} sketches or \texttt{CountMin} sketches with a constant number of columns (i.e., buckets) can be implicitly maintained in \emph{worst-case} $O(\log^{0.582} n)$ update time using $O(\log n)$ words of space, on a standard word…
▽ More
In this paper, we present a new algorithm for maintaining linear sketches in turnstile streams with faster update time. As an application, we show that $\log n$ \texttt{Count} sketches or \texttt{CountMin} sketches with a constant number of columns (i.e., buckets) can be implicitly maintained in \emph{worst-case} $O(\log^{0.582} n)$ update time using $O(\log n)$ words of space, on a standard word RAM with word-size $w=Θ(\log n)$. The exponent $0.582\approx 2ω/3-1$, where $ω$ is the current matrix multiplication exponent. Due to the numerous applications of linear sketches, our algorithm improves the update time for many streaming problems in turnstile streams, in the high success probability setting, without using more space, including $\ell_2$ norm estimation, $\ell_2$ heavy hitters, point query with $\ell_1$ or $\ell_2$ error, etc. Our algorithm generalizes, with the same update time and space, to maintaining $\log n$ linear sketches, where each sketch partitions the coordinates into $k<\log^{o(1)} n$ buckets using a $c$-wise independent hash function for constant $c$, and maintains the sum of coordinates for each bucket. Moreover, if arbitrary word operations are allowed, the update time can be further improved to $O(\log^{0.187} n)$, where $0.187\approx ω/2-1$. Our update algorithm is adaptive, and it circumvents the non-adaptive cell-probe lower bounds for turnstile streaming algorithms by Larsen, Nelson and Nguy{ê}n (STOC'15).
On the other hand, our result also shows that proving unconditional cell-probe lower bound for the update time seems very difficult, even if the space is restricted to be (nearly) the optimum. If $ω=2$, the cell-probe update time of our algorithm would be $\log^{o(1)} n$. Hence, proving any higher lower bound would imply $ω>2$.
△ Less
Submitted 4 November, 2019;
originally announced November 2019.
-
Limits on the Universal Method for Matrix Multiplication
Authors:
Josh Alman
Abstract:
In this work, we prove limitations on the known methods for designing matrix multiplication algorithms. Alman and Vassilevska Williams recently defined the Universal Method, which substantially generalizes all the known approaches including Strassen's Laser Method and Cohn and Umans' Group Theoretic Method. We prove concrete lower bounds on the algorithms one can design by applying the Universal M…
▽ More
In this work, we prove limitations on the known methods for designing matrix multiplication algorithms. Alman and Vassilevska Williams recently defined the Universal Method, which substantially generalizes all the known approaches including Strassen's Laser Method and Cohn and Umans' Group Theoretic Method. We prove concrete lower bounds on the algorithms one can design by applying the Universal Method to many different tensors. Our proofs use new tools for upper bounding the asymptotic slice rank of a wide range of tensors. Our main result is that the Universal method applied to any Coppersmith-Winograd tensor $CW_q$ cannot yield a bound on $ω$, the exponent of matrix multiplication, better than $2.16805$. By comparison, it was previously only known that the weaker `Galactic Method' applied to $CW_q$ could not achieve an exponent of $2$.
We also study the Laser Method (which is, in principle, a highly special case of the Universal Method) and prove that it is "complete" for matrix multiplication algorithms: when it applies to a tensor $T$, it achieves $ω= 2$ if and only if it is possible for the Universal method applied to $T$ to achieve $ω= 2$. Hence, the Laser Method, which was originally used as an algorithmic tool, can also be seen as a lower bounding tool. For example, in their landmark paper, Coppersmith and Winograd achieved a bound of $ω\leq 2.376$, by applying the Laser Method to $CW_q$. By our result, the fact that they did not achieve $ω=2$ implies a lower bound on the Universal Method applied to $CW_q$. Indeed, if it were possible for the Universal Method applied to $CW_q$ to achieve $ω=2$, then Coppersmith and Winograd's application of the Laser Method would have achieved $ω=2$.
△ Less
Submitted 1 May, 2019; v1 submitted 20 December, 2018;
originally announced December 2018.
-
Limits on All Known (and Some Unknown) Approaches to Matrix Multiplication
Authors:
Josh Alman,
Virginia Vassilevska Williams
Abstract:
We study the known techniques for designing Matrix Multiplication algorithms. The two main approaches are the Laser method of Strassen, and the Group theoretic approach of Cohn and Umans. We define a generalization based on zeroing outs which subsumes these two approaches, which we call the Solar method, and an even more general method based on monomial degenerations, which we call the Galactic me…
▽ More
We study the known techniques for designing Matrix Multiplication algorithms. The two main approaches are the Laser method of Strassen, and the Group theoretic approach of Cohn and Umans. We define a generalization based on zeroing outs which subsumes these two approaches, which we call the Solar method, and an even more general method based on monomial degenerations, which we call the Galactic method.
We then design a suite of techniques for proving lower bounds on the value of $ω$, the exponent of matrix multiplication, which can be achieved by algorithms using many tensors $T$ and the Galactic method. Some of our techniques exploit `local' properties of $T$, like finding a sub-tensor of $T$ which is so `weak' that $T$ itself couldn't be used to achieve a good bound on $ω$, while others exploit `global' properties, like $T$ being a monomial degeneration of the structural tensor of a group algebra.
Our main result is that there is a universal constant $\ell>2$ such that a large class of tensors generalizing the Coppersmith-Winograd tensor $CW_q$ cannot be used within the Galactic method to show a bound on $ω$ better than $\ell$, for any $q$. We give evidence that previous lower-bounding techniques were not strong enough to show this. We also prove a number of complementary results along the way, including that for any group $G$, the structural tensor of $\mathbb{C}[G]$ can be used to recover the best bound on $ω$ which the Coppersmith-Winograd approach gets using $CW_{|G|-2}$ as long as the asymptotic rank of the structural tensor is not too large.
△ Less
Submitted 19 October, 2018;
originally announced October 2018.
-
An Illuminating Algorithm for the Light Bulb Problem
Authors:
Josh Alman
Abstract:
The Light Bulb Problem is one of the most basic problems in data analysis. One is given as input $n$ vectors in $\{-1,1\}^d$, which are all independently and uniformly random, except for a planted pair of vectors with inner product at least $ρ\cdot d$ for some constant $ρ> 0$. The task is to find the planted pair. The most straightforward algorithm leads to a runtime of $Ω(n^2)$. Algorithms based…
▽ More
The Light Bulb Problem is one of the most basic problems in data analysis. One is given as input $n$ vectors in $\{-1,1\}^d$, which are all independently and uniformly random, except for a planted pair of vectors with inner product at least $ρ\cdot d$ for some constant $ρ> 0$. The task is to find the planted pair. The most straightforward algorithm leads to a runtime of $Ω(n^2)$. Algorithms based on techniques like Locality-Sensitive Hashing achieve runtimes of $n^{2 - O(ρ)}$; as $ρ$ gets small, these approach quadratic.
Building on prior work, we give a new algorithm for this problem which runs in time $O(n^{1.582} + nd),$ regardless of how small $ρ$ is. This matches the best known runtime due to Karppa et al. Our algorithm combines techniques from previous work on the Light Bulb Problem with the so-called `polynomial method in algorithm design,' and has a simpler analysis than previous work. Our algorithm is also easily derandomized, leading to a deterministic algorithm for the Light Bulb Problem with the same runtime of $O(n^{1.582} + nd),$ improving previous results.
△ Less
Submitted 15 October, 2018;
originally announced October 2018.
-
Further limitations of the known approaches for matrix multiplication
Authors:
Josh Alman,
Virginia Vassilevska Williams
Abstract:
We consider the techniques behind the current best algorithms for matrix multiplication. Our results are threefold.
(1) We provide a unifying framework, showing that all known matrix multiplication running times since 1986 can be achieved from a single very natural tensor - the structural tensor $T_q$ of addition modulo an integer $q$.
(2) We show that if one applies a generalization of the kn…
▽ More
We consider the techniques behind the current best algorithms for matrix multiplication. Our results are threefold.
(1) We provide a unifying framework, showing that all known matrix multiplication running times since 1986 can be achieved from a single very natural tensor - the structural tensor $T_q$ of addition modulo an integer $q$.
(2) We show that if one applies a generalization of the known techniques (arbitrary zeroing out of tensor powers to obtain independent matrix products in order to use the asymptotic sum inequality of Schönhage) to an arbitrary monomial degeneration of $T_q$, then there is an explicit lower bound, depending on $q$, on the bound on the matrix multiplication exponent $ω$ that one can achieve. We also show upper bounds on the value $α$ that one can achieve, where $α$ is such that $n\times n^α\times n$ matrix multiplication can be computed in $n^{2+o(1)}$ time.
(3) We show that our lower bound on $ω$ approaches $2$ as $q$ goes to infinity. This suggests a promising approach to improving the bound on $ω$: for variable $q$, find a monomial degeneration of $T_q$ which, using the known techniques, produces an upper bound on $ω$ as a function of $q$. Then, take $q$ to infinity. It is not ruled out, and hence possible, that one can obtain $ω=2$ in this way.
△ Less
Submitted 19 December, 2017;
originally announced December 2017.
-
Dynamic Parameterized Problems and Algorithms
Authors:
Josh Alman,
Matthias Mnich,
Virginia Vassilevska Williams
Abstract:
Fixed-parameter algorithms and kernelization are two powerful methods to solve $\mathsf{NP}$-hard problems. Yet, so far those algorithms have been largely restricted to static inputs.
In this paper we provide fixed-parameter algorithms and kernelizations for fundamental $\mathsf{NP}$-hard problems with dynamic inputs. We consider a variety of parameterized graph and hitting set problems which ar…
▽ More
Fixed-parameter algorithms and kernelization are two powerful methods to solve $\mathsf{NP}$-hard problems. Yet, so far those algorithms have been largely restricted to static inputs.
In this paper we provide fixed-parameter algorithms and kernelizations for fundamental $\mathsf{NP}$-hard problems with dynamic inputs. We consider a variety of parameterized graph and hitting set problems which are known to have $f(k)n^{1+o(1)}$ time algorithms on inputs of size $n$, and we consider the question of whether there is a data structure that supports small updates (such as edge/vertex/set/element insertions and deletions) with an update time of $g(k)n^{o(1)}$; such an update time would be essentially optimal. Update and query times independent of $n$ are particularly desirable. Among many other results, we show that Feedback Vertex Set and $k$-Path admit dynamic algorithms with $f(k)\log^{O(1)}n$ update and query times for some function $f$ depending on the solution size $k$ only.
We complement our positive results by several conditional and unconditional lower bounds. For example, we show that unlike their undirected counterparts, Directed Feedback Vertex Set and Directed $k$-Path do not admit dynamic algorithms with $n^{o(1)}$ update and query times even for constant solution sizes $k\leq 3$, assuming popular hardness hypotheses. We also show that unconditionally, in the cell probe model, Directed Feedback Vertex Set cannot be solved with update time that is purely a function of $k$.
△ Less
Submitted 2 July, 2017;
originally announced July 2017.
-
Cell-Probe Lower Bounds from Online Communication Complexity
Authors:
Josh Alman,
Joshua R. Wang,
Huacheng Yu
Abstract:
In this work, we introduce an online model for communication complexity. Analogous to how online algorithms receive their input piece-by-piece, our model presents one of the players, Bob, his input piece-by-piece, and has the players Alice and Bob cooperate to compute a result each time before the next piece is revealed to Bob. This model has a closer and more natural correspondence to dynamic dat…
▽ More
In this work, we introduce an online model for communication complexity. Analogous to how online algorithms receive their input piece-by-piece, our model presents one of the players, Bob, his input piece-by-piece, and has the players Alice and Bob cooperate to compute a result each time before the next piece is revealed to Bob. This model has a closer and more natural correspondence to dynamic data structures than classic communication models do, and hence presents a new perspective on data structures.
We first present a tight lower bound for the online set intersection problem in the online communication model, demonstrating a general approach for proving online communication lower bounds. The online communication model prevents a batching trick that classic communication complexity allows, and yields a stronger lower bound. We then apply the online communication model to prove data structure lower bounds for two dynamic data structure problems: the Group Range problem and the Dynamic Connectivity problem for forests. Both of the problems admit a worst case $O(\log n)$-time data structure. Using online communication complexity, we prove a tight cell-probe lower bound for each: spending $o(\log n)$ (even amortized) time per operation results in at best an $\exp(-δ^2 n)$ probability of correctly answering a $(1/2+δ)$-fraction of the $n$ queries.
△ Less
Submitted 15 November, 2017; v1 submitted 20 April, 2017;
originally announced April 2017.
-
Probabilistic Rank and Matrix Rigidity
Authors:
Josh Alman,
Ryan Williams
Abstract:
We consider a notion of probabilistic rank and probabilistic sign-rank of a matrix, which measures the extent to which a matrix can be probabilistically represented by low-rank matrices. We demonstrate several connections with matrix rigidity, communication complexity, and circuit lower bounds, including:
The Walsh-Hadamard Transform is Not Very Rigid. We give surprising upper bounds on the rigi…
▽ More
We consider a notion of probabilistic rank and probabilistic sign-rank of a matrix, which measures the extent to which a matrix can be probabilistically represented by low-rank matrices. We demonstrate several connections with matrix rigidity, communication complexity, and circuit lower bounds, including:
The Walsh-Hadamard Transform is Not Very Rigid. We give surprising upper bounds on the rigidity of a family of matrices whose rigidity has been extensively studied, and was conjectured to be highly rigid. For the $2^n \times 2^n$ Walsh-Hadamard transform $H_n$ (a.k.a. Sylvester matrices, or the communication matrix of Inner Product mod 2), we show how to modify only $2^{εn}$ entries in each row and make the rank drop below $2^{n(1-Ω(ε^2/\log(1/ε)))}$, for all $ε> 0$, over any field. That is, it is not possible to prove arithmetic circuit lower bounds on Hadamard matrices, via L. Valiant's matrix rigidity approach. We also show non-trivial rigidity upper bounds for $H_n$ with smaller target rank.
Matrix Rigidity and Threshold Circuit Lower Bounds. We give new consequences of rigid matrices for Boolean circuit complexity. We show that explicit $n \times n$ Boolean matrices which maintain rank at least $2^{(\log n)^{1-δ}}$ after $n^2/2^{(\log n)^{δ/2}}$ modified entries would yield a function lacking sub-quadratic-size $AC^0$ circuits with two layers of arbitrary linear threshold gates. We also prove that explicit 0/1 matrices over $\mathbb{R}$ which are modestly more rigid than the best known rigidity lower bounds for sign-rank would imply strong lower bounds for the infamously difficult class $THR\circ THR$.
△ Less
Submitted 7 January, 2017; v1 submitted 16 November, 2016;
originally announced November 2016.
-
Polynomial Representations of Threshold Functions and Algorithmic Applications
Authors:
Josh Alman,
Timothy M. Chan,
Ryan Williams
Abstract:
We design new polynomials for representing threshold functions in three different regimes: probabilistic polynomials of low degree, which need far less randomness than previous constructions, polynomial threshold functions (PTFs) with "nice" threshold behavior and degree almost as low as the probabilistic polynomials, and a new notion of probabilistic PTFs where we combine the above techniques to…
▽ More
We design new polynomials for representing threshold functions in three different regimes: probabilistic polynomials of low degree, which need far less randomness than previous constructions, polynomial threshold functions (PTFs) with "nice" threshold behavior and degree almost as low as the probabilistic polynomials, and a new notion of probabilistic PTFs where we combine the above techniques to achieve even lower degree with similar "nice" threshold behavior. Utilizing these polynomial constructions, we design faster algorithms for a variety of problems:
$\bullet$ Offline Hamming Nearest (and Furthest) Neighbors: Given $n$ red and $n$ blue points in $d$-dimensional Hamming space for $d=c\log n$, we can find an (exact) nearest (or furthest) blue neighbor for every red point in randomized time $n^{2-1/O(\sqrt{c}\log^{2/3}c)}$ or deterministic time $n^{2-1/O(c\log^2c)}$. These also lead to faster MAX-SAT algorithms for sparse CNFs.
$\bullet$ Offline Approximate Nearest (and Furthest) Neighbors: Given $n$ red and $n$ blue points in $d$-dimensional $\ell_1$ or Euclidean space, we can find a $(1+ε)$-approximate nearest (or furthest) blue neighbor for each red point in randomized time near $dn+n^{2-Ω(ε^{1/3}/\log(1/ε))}$.
$\bullet$ SAT Algorithms and Lower Bounds for Circuits With Linear Threshold Functions: We give a satisfiability algorithm for $AC^0[m]\circ LTF\circ LTF$ circuits with a subquadratic number of linear threshold gates on the bottom layer, and a subexponential number of gates on the other layers, that runs in deterministic $2^{n-n^ε}$ time. This also implies new circuit lower bounds for threshold circuits. We also give a randomized $2^{n-n^ε}$-time SAT algorithm for subexponential-size $MAJ\circ AC^0\circ LTF\circ AC^0\circ LTF$ circuits, where the top $MAJ$ gate and middle $LTF$ gates have $O(n^{6/5-δ})$ fan-in.
△ Less
Submitted 15 August, 2016;
originally announced August 2016.
-
Probabilistic Polynomials and Hamming Nearest Neighbors
Authors:
Josh Alman,
Ryan Williams
Abstract:
We show how to compute any symmetric Boolean function on $n$ variables over any field (as well as the integers) with a probabilistic polynomial of degree $O(\sqrt{n \log(1/ε)})$ and error at most $ε$. The degree dependence on $n$ and $ε$ is optimal, matching a lower bound of Razborov (1987) and Smolensky (1987) for the MAJORITY function. The proof is constructive: a low-degree polynomial can be ef…
▽ More
We show how to compute any symmetric Boolean function on $n$ variables over any field (as well as the integers) with a probabilistic polynomial of degree $O(\sqrt{n \log(1/ε)})$ and error at most $ε$. The degree dependence on $n$ and $ε$ is optimal, matching a lower bound of Razborov (1987) and Smolensky (1987) for the MAJORITY function. The proof is constructive: a low-degree polynomial can be efficiently sampled from the distribution.
This polynomial construction is combined with other algebraic ideas to give the first subquadratic time algorithm for computing a (worst-case) batch of Hamming distances in superlogarithmic dimensions, exactly. To illustrate, let $c(n) : \mathbb{N} \rightarrow \mathbb{N}$. Suppose we are given a database $D$ of $n$ vectors in $\{0,1\}^{c(n) \log n}$ and a collection of $n$ query vectors $Q$ in the same dimension. For all $u \in Q$, we wish to compute a $v \in D$ with minimum Hamming distance from $u$. We solve this problem in $n^{2-1/O(c(n) \log^2 c(n))}$ randomized time. Hence, the problem is in "truly subquadratic" time for $O(\log n)$ dimensions, and in subquadratic time for $d = o((\log^2 n)/(\log \log n)^2)$. We apply the algorithm to computing pairs with maximum inner product, closest pair in $\ell_1$ for vectors with bounded integer entries, and pairs with maximum Jaccard coefficients.
△ Less
Submitted 17 July, 2015;
originally announced July 2015.