-
DartMinHash: Fast Sketching for Weighted Sets
Authors:
Tobias Christiani
Abstract:
Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines. We introduce a simple algorithm that takes a weighted set $x \in \mathbb{R}_{\geq 0}^{d}$ and computes $k$ independent minhashes in expected time $O(k \log k + \Vert x \Vert_{0}\log( \Vert x \Vert_1 + 1/\Vert x \Vert_1))$, improving upon the state-of-the…
▽ More
Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines. We introduce a simple algorithm that takes a weighted set $x \in \mathbb{R}_{\geq 0}^{d}$ and computes $k$ independent minhashes in expected time $O(k \log k + \Vert x \Vert_{0}\log( \Vert x \Vert_1 + 1/\Vert x \Vert_1))$, improving upon the state-of-the-art BagMinHash algorithm (KDD '18) and representing the fastest weighted minhash algorithm for sparse data. Our experiments show running times that scale better with $k$ and $\Vert x \Vert_0$ compared to ICWS (ICDM '10) and BagMinhash, obtaining $10$x speedups in common use cases. Our approach also gives rise to a technique for computing fully independent locality-sensitive hash values for $(L, K)$-parameterized approximate near neighbor search under weighted Jaccard similarity in optimal expected time $O(LK + \Vert x \Vert_0)$, improving on prior work even in the case of unweighted sets.
△ Less
Submitted 23 May, 2020;
originally announced May 2020.
-
PUFFINN: Parameterless and Universally Fast FInding of Nearest Neighbors
Authors:
Martin Aumüller,
Tobias Christiani,
Rasmus Pagh,
Michael Vesterli
Abstract:
We present PUFFINN, a parameterless LSH-based index for solving the $k$-nearest neighbor problem with probabilistic guarantees. By parameterless we mean that the user is only required to specify the amount of memory the index is supposed to use and the result quality that should be achieved. The index combines several heuristic ideas known in the literature. By small adaptions to the query algorit…
▽ More
We present PUFFINN, a parameterless LSH-based index for solving the $k$-nearest neighbor problem with probabilistic guarantees. By parameterless we mean that the user is only required to specify the amount of memory the index is supposed to use and the result quality that should be achieved. The index combines several heuristic ideas known in the literature. By small adaptions to the query algorithm, we make heuristics rigorous. We perform experiments on real-world and synthetic inputs to evaluate implementation choices and show that the implementation satisfies the quality guarantees while being competitive with other state-of-the-art approaches to nearest neighbor search.
We describe a novel synthetic data set that is difficult to solve for almost all existing nearest neighbor search approaches, and for which PUFFINN significantly outperform previous methods.
△ Less
Submitted 28 June, 2019;
originally announced June 2019.
-
Algorithms for Similarity Search and Pseudorandomness
Authors:
Tobias Christiani
Abstract:
We study the problem of approximate near neighbor (ANN) search and show the following results:
- An improved framework for solving the ANN problem using locality-sensitive hashing, reducing the number of evaluations of locality-sensitive hash functions and the word-RAM complexity compared to the standard framework.
- A framework for solving the ANN problem with space-time tradeoffs as well as…
▽ More
We study the problem of approximate near neighbor (ANN) search and show the following results:
- An improved framework for solving the ANN problem using locality-sensitive hashing, reducing the number of evaluations of locality-sensitive hash functions and the word-RAM complexity compared to the standard framework.
- A framework for solving the ANN problem with space-time tradeoffs as well as tight upper and lower bounds for the space-time tradeoff of framework solutions to the ANN problem under cosine similarity.
- A novel approach to solving the ANN problem on sets along with a matching lower bound, improving the state of the art.
- A self-tuning version of the algorithm is shown through experiments to outperform existing similarity join algorithms.
- Tight lower bounds for asymmetric locality-sensitive hashing which has applications to the approximate furthest neighbor problem, orthogonal vector search, and annulus queries.
- A proof of the optimality of a well-known Boolean locality-sensitive hashing scheme.
We study the problem of efficient algorithms for producing high-quality pseudorandom numbers and obtain the following results:
- A deterministic algorithm for generating pseudorandom numbers of arbitrarily high quality in constant time using near-optimal space.
- A randomized construction of a family of hash functions that outputs pseudorandom numbers of arbitrarily high quality with space usage and running time nearly matching known cell-probe lower bounds.
△ Less
Submitted 22 June, 2019;
originally announced June 2019.
-
Confirmation Sampling for Exact Nearest Neighbor Search
Authors:
Tobias Christiani,
Rasmus Pagh,
Mikkel Thorup
Abstract:
Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC '98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest neighbor problems, in practice LSH data structures with suitably chosen parameters are used to solve the exact nearest neighbor problem (with some error pro…
▽ More
Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC '98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest neighbor problems, in practice LSH data structures with suitably chosen parameters are used to solve the exact nearest neighbor problem (with some error probability). Sublinear query time is often possible in practice even for exact nearest neighbor search, intuitively because the nearest neighbor tends to be significantly closer than other data points. However, theory offers little advice on how to choose LSH parameters outside of pre-specified worst-case settings.
We introduce the technique of confirmation sampling for solving the exact nearest neighbor problem using LSH. First, we give a general reduction that transforms a sequence of data structures that each find the nearest neighbor with a small, unknown probability, into a data structure that returns the nearest neighbor with probability $1-δ$, using as few queries as possible. Second, we present a new query algorithm for the LSH Forest data structure with $L$ trees that is able to return the exact nearest neighbor of a query point within the same time bound as an LSH Forest of $Ω(L)$ trees with internal parameters specifically tuned to the query and data.
△ Less
Submitted 6 December, 2018;
originally announced December 2018.
-
Optimal Boolean Locality-Sensitive Hashing
Authors:
Tobias Christiani
Abstract:
For $0 \leq β< α< 1$ the distribution $\mathcal{H}$ over Boolean functions $h \colon \{-1, 1\}^d \to \{-1, 1\}$ that minimizes the expression \begin{equation*}
ρ_{α, β} = \frac{\log(1/\Pr_{\substack{h \sim \mathcal{H} \\ (x, y) \text{ $α$-corr.}}}[h(x) = h(y)])}{\log(1/\Pr_{\substack{h \sim \mathcal{H} \\ (x, y) \text{ $β$-corr.}}}[h(x) = h(y)])} \end{equation*} assigns nonzero probability only…
▽ More
For $0 \leq β< α< 1$ the distribution $\mathcal{H}$ over Boolean functions $h \colon \{-1, 1\}^d \to \{-1, 1\}$ that minimizes the expression \begin{equation*}
ρ_{α, β} = \frac{\log(1/\Pr_{\substack{h \sim \mathcal{H} \\ (x, y) \text{ $α$-corr.}}}[h(x) = h(y)])}{\log(1/\Pr_{\substack{h \sim \mathcal{H} \\ (x, y) \text{ $β$-corr.}}}[h(x) = h(y)])} \end{equation*} assigns nonzero probability only to members of the set of dictator functions $h(x) = \pm x_i$.
△ Less
Submitted 4 December, 2018;
originally announced December 2018.
-
Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search
Authors:
Tobias Christiani
Abstract:
The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a general technique for constructing a data structure to answer approximate near neighbor queries by using a distribution $\mathcal{H}$ over locality-sensitive hash functions that partition space. For a collection of $n$ points, after preprocessing, the query time is dominated by $O(n^ρ \log n)$ evaluations of hash functio…
▽ More
The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a general technique for constructing a data structure to answer approximate near neighbor queries by using a distribution $\mathcal{H}$ over locality-sensitive hash functions that partition space. For a collection of $n$ points, after preprocessing, the query time is dominated by $O(n^ρ \log n)$ evaluations of hash functions from $\mathcal{H}$ and $O(n^ρ)$ hash table lookups and distance computations where $ρ\in (0,1)$ is determined by the locality-sensitivity properties of $\mathcal{H}$. It follows from a recent result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive hash functions can be reduced to $O(\log^2 n)$, leaving the query time to be dominated by $O(n^ρ)$ distance computations and $O(n^ρ \log n)$ additional word-RAM operations. We state this result as a general framework and provide a simpler analysis showing that the number of lookups and distance computations closely match the Indyk-Motwani framework, making it a viable replacement in practice. Using ideas from another locality-sensitive hashing framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of additional word-RAM operations to $O(n^ρ)$.
△ Less
Submitted 16 February, 2018; v1 submitted 24 August, 2017;
originally announced August 2017.
-
Scalable and robust set similarity join
Authors:
Tobias Christiani,
Rasmus Pagh,
Johan Sivertsen
Abstract:
Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be important --- indeed, where the exact set similarity join is itself only an ap…
▽ More
Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be important --- indeed, where the exact set similarity join is itself only an approximation of the desired result set.
We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present state-of-the-art exact methods are based on prefix-filtering, the performance of which depends on the data set having many rare tokens. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often more than an order of magnitude faster than state-of-the-art exact methods, depending on how well a data set lends itself to prefix filtering. Our experiments on benchmark data sets also show that the method is several times faster than comparable approximate methods. Our algorithm makes use of recent theoretical advances in high-dimensional sketching and indexing that we believe to be of wider relevance to the data engineering community.
△ Less
Submitted 2 March, 2018; v1 submitted 21 July, 2017;
originally announced July 2017.
-
Distance-Sensitive hashing
Authors:
Martin Aumüller,
Tobias Christiani,
Rasmus Pagh,
Francesco Silvestri
Abstract:
Locality-sensitive hashing (LSH) is an important tool for managing high-dimensional noisy or uncertain data, for example in connection with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH framework is not known to yield good solutions, and instead ad hoc solutions have been designed for particular similarity and distance measur…
▽ More
Locality-sensitive hashing (LSH) is an important tool for managing high-dimensional noisy or uncertain data, for example in connection with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH framework is not known to yield good solutions, and instead ad hoc solutions have been designed for particular similarity and distance measures. For example, this is true for output-sensitive similarity search/join, and for indexes supporting annulus queries that aim to report a point close to a certain given distance from the query point.
In this paper we initiate the study of distance-sensitive hashing (DSH), a generalization of LSH that seeks a family of hash functions such that the probability of two points having the same hash value is a given function of the distance between them. More precisely, given a distance space $(X, \text{dist})$ and a "collision probability function" (CPF) $f\colon \mathbb{R}\rightarrow [0,1]$ we seek a distribution over pairs of functions $(h,g)$ such that for every pair of points $x, y \in X$ the collision probability is $\Pr[h(x)=g(y)] = f(\text{dist}(x,y))$. Locality-sensitive hashing is the study of how fast a CPF can decrease as the distance grows. For many spaces, $f$ can be made exponentially decreasing even if we restrict attention to the symmetric case where $g=h$. We show that the asymmetry achieved by having a pair of functions makes it possible to achieve CPFs that are, for example, increasing or unimodal, and show how this leads to principled solutions to problems not addressed by the LSH framework. This includes a novel application to privacy-preserving distance estimation. We believe that the DSH framework will find further applications in high-dimensional data management.
△ Less
Submitted 17 April, 2018; v1 submitted 22 March, 2017;
originally announced March 2017.
-
Set Similarity Search Beyond MinHash
Authors:
Tobias Christiani,
Rasmus Pagh
Abstract:
We consider the problem of approximate set similarity search under Braun-Blanquet similarity $B(\mathbf{x}, \mathbf{y}) = |\mathbf{x} \cap \mathbf{y}| / \max(|\mathbf{x}|, |\mathbf{y}|)$. The $(b_2, b_2)$-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets $P$ such that, given a query set $\mathbf{q}$, if there exists $\mathbf{x} \in P$ with…
▽ More
We consider the problem of approximate set similarity search under Braun-Blanquet similarity $B(\mathbf{x}, \mathbf{y}) = |\mathbf{x} \cap \mathbf{y}| / \max(|\mathbf{x}|, |\mathbf{y}|)$. The $(b_2, b_2)$-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets $P$ such that, given a query set $\mathbf{q}$, if there exists $\mathbf{x} \in P$ with $B(\mathbf{q}, \mathbf{x}) \geq b_1$, then we can efficiently return $\mathbf{x}' \in P$ with $B(\mathbf{q}, \mathbf{x}') > b_2$.
We present a simple data structure that solves this problem with space usage $O(n^{1+ρ}\log n + \sum_{\mathbf{x} \in P}|\mathbf{x}|)$ and query time $O(|\mathbf{q}|n^ρ \log n)$ where $n = |P|$ and $ρ= \log(1/b_1)/\log(1/b_2)$. Making use of existing lower bounds for locality-sensitive hashing by O'Donnell et al. (TOCT 2014) we show that this value of $ρ$ is tight across the parameter space, i.e., for every choice of constants $0 < b_2 < b_1 < 1$.
In the case where all sets have the same size our solution strictly improves upon the value of $ρ$ that can be obtained through the use of state-of-the-art data-independent techniques in the Indyk-Motwani locality-sensitive hashing framework (STOC 1998) such as Broder's MinHash (CCS 1997) for Jaccard similarity and Andoni et al.'s cross-polytope LSH (NIPS 2015) for cosine similarity. Surprisingly, even though our solution is data-independent, for a large part of the parameter space we outperform the currently best data-dependent method by Andoni and Razenshteyn (STOC 2015).
△ Less
Submitted 18 April, 2017; v1 submitted 22 December, 2016;
originally announced December 2016.
-
A Framework for Similarity Search with Space-Time Tradeoffs using Locality-Sensitive Filtering
Authors:
Tobias Christiani
Abstract:
We present a framework for similarity search based on Locality-Sensitive Filtering (LSF), generalizing the Indyk-Motwani (STOC 1998) Locality-Sensitive Hashing (LSH) framework to support space-time tradeoffs. Given a family of filters, defined as a distribution over pairs of subsets of space with certain locality-sensitivity properties, we can solve the approximate near neighbor problem in $d$-dim…
▽ More
We present a framework for similarity search based on Locality-Sensitive Filtering (LSF), generalizing the Indyk-Motwani (STOC 1998) Locality-Sensitive Hashing (LSH) framework to support space-time tradeoffs. Given a family of filters, defined as a distribution over pairs of subsets of space with certain locality-sensitivity properties, we can solve the approximate near neighbor problem in $d$-dimensional space for an $n$-point data set with query time $dn^{ρ_q+o(1)}$, update time $dn^{ρ_u+o(1)}$, and space usage $dn + n^{1 + ρ_u + o(1)}$. The space-time tradeoff is tied to the tradeoff between query time and update time, controlled by the exponents $ρ_q, ρ_u$ that are determined by the filter family. Locality-sensitive filtering was introduced by Becker et al. (SODA 2016) together with a framework yielding a single, balanced, tradeoff between query time and space, further relying on the assumption of an efficient oracle for the filter evaluation algorithm. We extend the LSF framework to support space-time tradeoffs and through a combination of existing techniques we remove the oracle assumption.
Building on a filter family for the unit sphere by Laarhoven (arXiv 2015) we use a kernel embedding technique by Rahimi & Recht (NIPS 2007) to show a solution to the $(r,cr)$-near neighbor problem in $\ell_s^d$-space for $0 < s \leq 2$ with query and update exponents $ρ_q=\frac{c^s(1+λ)^2}{(c^s+λ)^2}$ and $ρ_u=\frac{c^s(1-λ)^2}{(c^s+λ)^2}$ where $λ\in[-1,1]$ is a tradeoff parameter. This result improves upon the space-time tradeoff of Kapralov (PODS 2015) and is shown to be optimal in the case of a balanced tradeoff. Finally, we show a lower bound for the space-time tradeoff on the unit sphere that matches Laarhoven's and our own upper bound in the case of random data.
△ Less
Submitted 22 November, 2016; v1 submitted 9 May, 2016;
originally announced May 2016.
-
From Independence to Expansion and Back Again
Authors:
Tobias Christiani,
Rasmus Pagh,
Mikkel Thorup
Abstract:
We consider the following fundamental problems: (1) Constructing $k$-independent hash functions with a space-time tradeoff close to Siegel's lower bound. (2) Constructing representations of unbalanced expander graphs having small size and allowing fast computation of the neighbor function. It is not hard to show that these problems are intimately connected in the sense that a good solution to one…
▽ More
We consider the following fundamental problems: (1) Constructing $k$-independent hash functions with a space-time tradeoff close to Siegel's lower bound. (2) Constructing representations of unbalanced expander graphs having small size and allowing fast computation of the neighbor function. It is not hard to show that these problems are intimately connected in the sense that a good solution to one of them leads to a good solution to the other one. In this paper we exploit this connection to present efficient, recursive constructions of $k$-independent hash functions (and hence expanders with a small representation). While the previously most efficient construction (Thorup, FOCS 2013) needed time quasipolynomial in Siegel's lower bound, our time bound is just a logarithmic factor from the lower bound.
△ Less
Submitted 11 June, 2015;
originally announced June 2015.
-
Generating k-independent variables in constant time
Authors:
Tobias Christiani,
Rasmus Pagh
Abstract:
The generation of pseudorandom elements over finite fields is fundamental to the time, space and randomness complexity of randomized algorithms and data structures. We consider the problem of generating $k$-independent random values over a finite field $\mathbb{F}$ in a word RAM model equipped with constant time addition and multiplication in $\mathbb{F}$, and present the first nontrivial construc…
▽ More
The generation of pseudorandom elements over finite fields is fundamental to the time, space and randomness complexity of randomized algorithms and data structures. We consider the problem of generating $k$-independent random values over a finite field $\mathbb{F}$ in a word RAM model equipped with constant time addition and multiplication in $\mathbb{F}$, and present the first nontrivial construction of a generator that outputs each value in constant time, not dependent on $k$. Our generator has period length $|\mathbb{F}|\,\mbox{poly} \log k$ and uses $k\,\mbox{poly}(\log k) \log |\mathbb{F}|$ bits of space, which is optimal up to a $\mbox{poly} \log k$ factor. We are able to bypass Siegel's lower bound on the time-space tradeoff for $k$-independent functions by a restriction to sequential evaluation.
△ Less
Submitted 9 August, 2014;
originally announced August 2014.