Skip to main content

Showing 1–12 of 12 results for author: Christiani, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2005.11547  [pdf, other

    cs.DS cs.IR cs.LG

    DartMinHash: Fast Sketching for Weighted Sets

    Authors: Tobias Christiani

    Abstract: Weighted minwise hashing is a standard dimensionality reduction technique with applications to similarity search and large-scale kernel machines. We introduce a simple algorithm that takes a weighted set $x \in \mathbb{R}_{\geq 0}^{d}$ and computes $k$ independent minhashes in expected time $O(k \log k + \Vert x \Vert_{0}\log( \Vert x \Vert_1 + 1/\Vert x \Vert_1))$, improving upon the state-of-the… ▽ More

    Submitted 23 May, 2020; originally announced May 2020.

    Comments: See https://github.com/tobc/dartminhash for the code accompanying the experiments

  2. arXiv:1906.12211  [pdf, other

    cs.DS cs.CG

    PUFFINN: Parameterless and Universally Fast FInding of Nearest Neighbors

    Authors: Martin Aumüller, Tobias Christiani, Rasmus Pagh, Michael Vesterli

    Abstract: We present PUFFINN, a parameterless LSH-based index for solving the $k$-nearest neighbor problem with probabilistic guarantees. By parameterless we mean that the user is only required to specify the amount of memory the index is supposed to use and the result quality that should be achieved. The index combines several heuristic ideas known in the literature. By small adaptions to the query algorit… ▽ More

    Submitted 28 June, 2019; originally announced June 2019.

    Comments: Extended version of the ESA 2019 paper

  3. arXiv:1906.09430  [pdf, other

    cs.DS

    Algorithms for Similarity Search and Pseudorandomness

    Authors: Tobias Christiani

    Abstract: We study the problem of approximate near neighbor (ANN) search and show the following results: - An improved framework for solving the ANN problem using locality-sensitive hashing, reducing the number of evaluations of locality-sensitive hash functions and the word-RAM complexity compared to the standard framework. - A framework for solving the ANN problem with space-time tradeoffs as well as… ▽ More

    Submitted 22 June, 2019; originally announced June 2019.

    Comments: PhD thesis

  4. arXiv:1812.02603  [pdf, ps, other

    cs.DS

    Confirmation Sampling for Exact Nearest Neighbor Search

    Authors: Tobias Christiani, Rasmus Pagh, Mikkel Thorup

    Abstract: Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC '98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest neighbor problems, in practice LSH data structures with suitably chosen parameters are used to solve the exact nearest neighbor problem (with some error pro… ▽ More

    Submitted 6 December, 2018; originally announced December 2018.

  5. arXiv:1812.01557  [pdf, ps, other

    cs.DM

    Optimal Boolean Locality-Sensitive Hashing

    Authors: Tobias Christiani

    Abstract: For $0 \leq β< α< 1$ the distribution $\mathcal{H}$ over Boolean functions $h \colon \{-1, 1\}^d \to \{-1, 1\}$ that minimizes the expression \begin{equation*} ρ_{α, β} = \frac{\log(1/\Pr_{\substack{h \sim \mathcal{H} \\ (x, y) \text{ $α$-corr.}}}[h(x) = h(y)])}{\log(1/\Pr_{\substack{h \sim \mathcal{H} \\ (x, y) \text{ $β$-corr.}}}[h(x) = h(y)])} \end{equation*} assigns nonzero probability only… ▽ More

    Submitted 4 December, 2018; originally announced December 2018.

  6. arXiv:1708.07586  [pdf, other

    cs.DS

    Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search

    Authors: Tobias Christiani

    Abstract: The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a general technique for constructing a data structure to answer approximate near neighbor queries by using a distribution $\mathcal{H}$ over locality-sensitive hash functions that partition space. For a collection of $n$ points, after preprocessing, the query time is dominated by $O(n^ρ \log n)$ evaluations of hash functio… ▽ More

    Submitted 16 February, 2018; v1 submitted 24 August, 2017; originally announced August 2017.

    Comments: 15 pages, 3 figures

    ACM Class: E.1; H.3.3

  7. arXiv:1707.06814  [pdf, other

    cs.DB cs.DS

    Scalable and robust set similarity join

    Authors: Tobias Christiani, Rasmus Pagh, Johan Sivertsen

    Abstract: Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be important --- indeed, where the exact set similarity join is itself only an ap… ▽ More

    Submitted 2 March, 2018; v1 submitted 21 July, 2017; originally announced July 2017.

  8. arXiv:1703.07867  [pdf, other

    cs.DS

    Distance-Sensitive hashing

    Authors: Martin Aumüller, Tobias Christiani, Rasmus Pagh, Francesco Silvestri

    Abstract: Locality-sensitive hashing (LSH) is an important tool for managing high-dimensional noisy or uncertain data, for example in connection with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH framework is not known to yield good solutions, and instead ad hoc solutions have been designed for particular similarity and distance measur… ▽ More

    Submitted 17 April, 2018; v1 submitted 22 March, 2017; originally announced March 2017.

    Comments: Accepted at PODS'18. Abstract shortened due to character limit

    ACM Class: H.3.3

  9. arXiv:1612.07710  [pdf, other

    cs.DS

    Set Similarity Search Beyond MinHash

    Authors: Tobias Christiani, Rasmus Pagh

    Abstract: We consider the problem of approximate set similarity search under Braun-Blanquet similarity $B(\mathbf{x}, \mathbf{y}) = |\mathbf{x} \cap \mathbf{y}| / \max(|\mathbf{x}|, |\mathbf{y}|)$. The $(b_2, b_2)$-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets $P$ such that, given a query set $\mathbf{q}$, if there exists $\mathbf{x} \in P$ with… ▽ More

    Submitted 18 April, 2017; v1 submitted 22 December, 2016; originally announced December 2016.

    Comments: The first arXiv version of this paper introduced an upper bound for Jaccard similarity search that was based on a miscalculation which led the authors to believe that the "hardest instances" for Jaccard similarity search using Chosen Path occurs when all sets have the same size. The question of which existing technique is better depends on set sizes and similarity thresholds (details in paper)

  10. arXiv:1605.02687  [pdf, ps, other

    cs.DS

    A Framework for Similarity Search with Space-Time Tradeoffs using Locality-Sensitive Filtering

    Authors: Tobias Christiani

    Abstract: We present a framework for similarity search based on Locality-Sensitive Filtering (LSF), generalizing the Indyk-Motwani (STOC 1998) Locality-Sensitive Hashing (LSH) framework to support space-time tradeoffs. Given a family of filters, defined as a distribution over pairs of subsets of space with certain locality-sensitivity properties, we can solve the approximate near neighbor problem in $d$-dim… ▽ More

    Submitted 22 November, 2016; v1 submitted 9 May, 2016; originally announced May 2016.

    Comments: Accepted to SODA'17. See the paper for the complete abstract

  11. arXiv:1506.03676  [pdf, ps, other

    cs.DS

    From Independence to Expansion and Back Again

    Authors: Tobias Christiani, Rasmus Pagh, Mikkel Thorup

    Abstract: We consider the following fundamental problems: (1) Constructing $k$-independent hash functions with a space-time tradeoff close to Siegel's lower bound. (2) Constructing representations of unbalanced expander graphs having small size and allowing fast computation of the neighbor function. It is not hard to show that these problems are intimately connected in the sense that a good solution to one… ▽ More

    Submitted 11 June, 2015; originally announced June 2015.

    Comments: An extended abstract of this paper was accepted to The 47th ACM Symposium on Theory of Computing (STOC 2015). Copyright ACM

    ACM Class: E.1; E.2; G.2.2; F.2.2; G.3

  12. arXiv:1408.2157  [pdf, ps, other

    cs.DS

    Generating k-independent variables in constant time

    Authors: Tobias Christiani, Rasmus Pagh

    Abstract: The generation of pseudorandom elements over finite fields is fundamental to the time, space and randomness complexity of randomized algorithms and data structures. We consider the problem of generating $k$-independent random values over a finite field $\mathbb{F}$ in a word RAM model equipped with constant time addition and multiplication in $\mathbb{F}$, and present the first nontrivial construc… ▽ More

    Submitted 9 August, 2014; originally announced August 2014.

    Comments: Accepted to The 55th Annual Symposium on Foundations of Computer Science (FOCS 2014). Copyright IEEE