Search | arXiv e-print repository

VIBE: Vector Index Benchmark for Embeddings

Authors: Elias Jääsaari, Ville Hyvönen, Matteo Ceccarello, Teemu Roos, Martin Aumüller

Abstract: Approximate nearest neighbor (ANN) search is a performance-critical component of many machine learning pipelines. Rigorous benchmarking is essential for evaluating the performance of vector indexes for ANN search. However, the datasets of the existing benchmarks are no longer representative of the current applications of ANN search. Hence, there is an urgent need for an up-to-date set of benchmark… ▽ More Approximate nearest neighbor (ANN) search is a performance-critical component of many machine learning pipelines. Rigorous benchmarking is essential for evaluating the performance of vector indexes for ANN search. However, the datasets of the existing benchmarks are no longer representative of the current applications of ANN search. Hence, there is an urgent need for an up-to-date set of benchmarks. To this end, we introduce Vector Index Benchmark for Embeddings (VIBE), an open source project for benchmarking ANN algorithms. VIBE contains a pipeline for creating benchmark datasets using dense embedding models characteristic of modern applications, such as retrieval-augmented generation (RAG). To replicate real-world workloads, we also include out-of-distribution (OOD) datasets where the queries and the corpus are drawn from different distributions. We use VIBE to conduct a comprehensive evaluation of SOTA vector indexes, benchmarking 21 implementations on 12 in-distribution and 6 out-of-distribution datasets. △ Less

Submitted 23 May, 2025; originally announced May 2025.

Comments: 25 pages

arXiv:2409.17424 [pdf, other]

Results of the Big ANN: NeurIPS'23 competition

Authors: Harsha Vardhan Simhadri, Martin Aumüller, Amir Ingber, Matthijs Douze, George Williams, Magdalen Dobson Manohar, Dmitry Baranchuk, Edo Liberty, Frank Liu, Ben Landrum, Mazin Karjikar, Laxman Dhulipala, Meng Chen, Yue Chen, Rui Ma, Kai Zhang, Yuzheng Cai, Jiayang Shi, Yizhuo Chen, Weiguo Zheng, Zihao Wan, Jie Yin, Ben Huang

Abstract: The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search ~\cite{DBLP:conf/nips/SimhadriWADBBCH21}, this competi… ▽ More The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search ~\cite{DBLP:conf/nips/SimhadriWADBBCH21}, this competition addressed filtered search, out-of-distribution data, sparse and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources. The results showcased significant improvements in search accuracy and efficiency over industry-standard baselines, with notable contributions from both academic and industrial teams. This paper summarizes the competition tracks, datasets, evaluation metrics, and the innovative approaches of the top-performing submissions, providing insights into the current advancements and future directions in the field of approximate nearest neighbor search. △ Less

Submitted 25 September, 2024; originally announced September 2024.

Comments: Code: https://github.com/harsha-simhadri/big-ann-benchmarks/releases/tag/v0.3.0

ACM Class: H.3.3

arXiv:2409.07187 [pdf, other]

Differentially Private High-Dimensional Approximate Range Counting, Revisited

Authors: Martin Aumüller, Fabrizio Boninsegna, Francesco Silvestri

Abstract: Locality Sensitive Filters are known for offering a quasi-linear space data structure with rigorous guarantees for the Approximate Near Neighbor search (ANN) problem. Building on Locality Sensitive Filters, we derive a simple data structure for the Approximate Near Neighbor Counting (ANNC) problem under differential privacy (DP). Moreover, we provide a simple analysis leveraging a connection with… ▽ More Locality Sensitive Filters are known for offering a quasi-linear space data structure with rigorous guarantees for the Approximate Near Neighbor search (ANN) problem. Building on Locality Sensitive Filters, we derive a simple data structure for the Approximate Near Neighbor Counting (ANNC) problem under differential privacy (DP). Moreover, we provide a simple analysis leveraging a connection with concomitant statistics and extreme value theory. Our approach produces a simple data structure with a tunable parameter that regulates a trade-off between space-time and utility. Through this trade-off, our data structure achieves the same performance as the recent findings of Andoni et al. (NeurIPS 2023) while offering better utility at the cost of higher space and query time. In addition, we provide a more efficient algorithm under pure $\varepsilon$-DP and elucidate the connection between ANN and differentially private ANNC. As a side result, the paper provides a more compact description and analysis of Locality Sensitive Filters for Fair Near Neighbor Search, improving a previous result in Aumüller et al. (TODS 2022). △ Less

Submitted 2 May, 2025; v1 submitted 11 September, 2024; originally announced September 2024.

arXiv:2306.08745 [pdf, other]

PLAN: Variance-Aware Private Mean Estimation

Authors: Martin Aumüller, Christian Janos Lebeda, Boel Nelson, Rasmus Pagh

Abstract: Differentially private mean estimation is an important building block in privacy-preserving algorithms for data analysis and machine learning. Though the trade-off between privacy and utility is well understood in the worst case, many datasets exhibit structure that could potentially be exploited to yield better algorithms. In this paper we present $\textit{Private Limit Adapted Noise}$ (PLAN), a… ▽ More Differentially private mean estimation is an important building block in privacy-preserving algorithms for data analysis and machine learning. Though the trade-off between privacy and utility is well understood in the worst case, many datasets exhibit structure that could potentially be exploited to yield better algorithms. In this paper we present $\textit{Private Limit Adapted Noise}$ (PLAN), a family of differentially private algorithms for mean estimation in the setting where inputs are independently sampled from a distribution $\mathcal{D}$ over $\mathbf{R}^d$, with coordinate-wise standard deviations $\boldsymbolσ \in \mathbf{R}^d$. Similar to mean estimation under Mahalanobis distance, PLAN tailors the shape of the noise to the shape of the data, but unlike previous algorithms the privacy budget is spent non-uniformly over the coordinates. Under a concentration assumption on $\mathcal{D}$, we show how to exploit skew in the vector $\boldsymbolσ$, obtaining a (zero-concentrated) differentially private mean estimate with $\ell_2$ error proportional to $\|\boldsymbolσ\|_1$. Previous work has either not taken $\boldsymbolσ$ into account, or measured error in Mahalanobis distance $\unicode{x2013}$ in both cases resulting in $\ell_2$ error proportional to $\sqrt{d}\|\boldsymbolσ\|_2$, which can be up to a factor $\sqrt{d}$ larger. To verify the effectiveness of PLAN, we empirically evaluate accuracy on both synthetic and real world data. △ Less

Submitted 10 April, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

arXiv:2205.03763 [pdf, other]

Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search

Authors: Harsha Vardhan Simhadri, George Williams, Martin Aumüller, Matthijs Douze, Artem Babenko, Dmitry Baranchuk, Qi Chen, Lucas Hosseini, Ravishankar Krishnaswamy, Gopal Srinivasa, Suhas Jayaram Subramanya, Jingdong Wang

Abstract: Despite the broad range of algorithms for Approximate Nearest Neighbor Search, most empirical evaluations of algorithms have focused on smaller datasets, typically of 1 million points~\citep{Benchmark}. However, deploying recent advances in embedding based techniques for search, recommendation and ranking at scale require ANNS indices at billion, trillion or larger scale. Barring a few recent pape… ▽ More Despite the broad range of algorithms for Approximate Nearest Neighbor Search, most empirical evaluations of algorithms have focused on smaller datasets, typically of 1 million points~\citep{Benchmark}. However, deploying recent advances in embedding based techniques for search, recommendation and ranking at scale require ANNS indices at billion, trillion or larger scale. Barring a few recent papers, there is limited consensus on which algorithms are effective at this scale vis-à-vis their hardware cost. This competition compares ANNS algorithms at billion-scale by hardware cost, accuracy and performance. We set up an open source evaluation framework and leaderboards for both standardized and specialized hardware. The competition involves three tracks. The standard hardware track T1 evaluates algorithms on an Azure VM with limited DRAM, often the bottleneck in serving billion-scale indices, where the embedding data can be hundreds of GigaBytes in size. It uses FAISS~\citep{Faiss17} as the baseline. The standard hardware track T2 additional allows inexpensive SSDs in addition to the limited DRAM and uses DiskANN~\citep{DiskANN19} as the baseline. The specialized hardware track T3 allows any hardware configuration, and again uses FAISS as the baseline. We compiled six diverse billion-scale datasets, four newly released for this competition, that span a variety of modalities, data types, dimensions, deep learning models, distance functions and sources. The outcome of the competition was ranked leaderboards of algorithms in each track based on recall at a query throughput threshold. Additionally, for track T3, separate leaderboards were created based on recall as well as cost-normalized and power-normalized query throughput. △ Less

Submitted 7 May, 2022; originally announced May 2022.

arXiv:2107.02736 [pdf, other]

DEANN: Speeding up Kernel-Density Estimation using Approximate Nearest Neighbor Search

Authors: Matti Karppa, Martin Aumüller, Rasmus Pagh

Abstract: Kernel Density Estimation (KDE) is a nonparametric method for estimating the shape of a density function, given a set of samples from the distribution. Recently, locality-sensitive hashing, originally proposed as a tool for nearest neighbor search, has been shown to enable fast KDE data structures. However, these approaches do not take advantage of the many other advances that have been made in al… ▽ More Kernel Density Estimation (KDE) is a nonparametric method for estimating the shape of a density function, given a set of samples from the distribution. Recently, locality-sensitive hashing, originally proposed as a tool for nearest neighbor search, has been shown to enable fast KDE data structures. However, these approaches do not take advantage of the many other advances that have been made in algorithms for nearest neighbor algorithms. We present an algorithm called Density Estimation from Approximate Nearest Neighbors (DEANN) where we apply Approximate Nearest Neighbor (ANN) algorithms as a black box subroutine to compute an unbiased KDE. The idea is to find points that have a large contribution to the KDE using ANN, compute their contribution exactly, and approximate the remainder with Random Sampling (RS). We present a theoretical argument that supports the idea that an ANN subroutine can speed up the evaluation. Furthermore, we provide a C++ implementation with a Python interface that can make use of an arbitrary ANN implementation as a subroutine for kernel density estimation. We show empirically that our implementation outperforms state of the art implementations in all high dimensional datasets we considered, and matches the performance of RS in cases where the ANN yield no gains in performance. △ Less

Submitted 1 March, 2022; v1 submitted 6 July, 2021; originally announced July 2021.

Comments: 35 pages, 1 figure. AISTATS 2022

arXiv:2106.10068 [pdf, other]

Differentially Private Sparse Vectors with Low Error, Optimal Space, and Fast Access

Authors: Martin Aumüller, Christian Janos Lebeda, Rasmus Pagh

Abstract: Representing a sparse histogram, or more generally a sparse vector, is a fundamental task in differential privacy. An ideal solution would use space close to information-theoretical lower bounds, have an error distribution that depends optimally on the desired privacy level, and allow fast random access to entries in the vector. However, existing approaches have only achieved two of these three go… ▽ More Representing a sparse histogram, or more generally a sparse vector, is a fundamental task in differential privacy. An ideal solution would use space close to information-theoretical lower bounds, have an error distribution that depends optimally on the desired privacy level, and allow fast random access to entries in the vector. However, existing approaches have only achieved two of these three goals. In this paper we introduce the Approximate Laplace Projection (ALP) mechanism for approximating k-sparse vectors. This mechanism is shown to simultaneously have information-theoretically optimal space (up to constant factors), fast access to vector entries, and error of the same magnitude as the Laplace-mechanism applied to dense vectors. A key new technique is a unary representation of small integers, which we show to be robust against ``randomized response'' noise. This representation is combined with hashing, in the spirit of Bloom filters, to obtain a space-efficient, differentially private representation. Our theoretical performance bounds are complemented by simulations which show that the constant factors on the main performance parameters are quite small, suggesting practicality of the technique. △ Less

Submitted 27 September, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

arXiv:2101.10905 [pdf, other]

Sampling a Near Neighbor in High Dimensions -- Who is the Fairest of Them All?

Authors: Martin Aumüller, Sariel Har-Peled, Sepideh Mahabadi, Rasmus Pagh, Francesco Silvestri

Abstract: Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points $S$ and a radius parameter $r>0$, the $r$-near neighbor ($r$-NN) problem asks for a data structure that, given any query point $q$, returns a point $p$ within distance at most $r$ from $q$. In this paper, we study the $r$-NN problem in the light of individual fairness a… ▽ More Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points $S$ and a radius parameter $r>0$, the $r$-near neighbor ($r$-NN) problem asks for a data structure that, given any query point $q$, returns a point $p$ within distance at most $r$ from $q$. In this paper, we study the $r$-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance $r$ from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. In this work, we show that LSH based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights the inherent unfairness of NN data structures and shows the performance of our algorithms on real-world datasets. △ Less

Submitted 26 January, 2021; originally announced January 2021.

Comments: arXiv admin note: text overlap with arXiv:1906.02640

arXiv:2008.08134 [pdf, other]

Differentially Private Sketches for Jaccard Similarity Estimation

Authors: Martin Aumüller, Anders Bourgeat, Jana Schmurr

Abstract: This paper describes two locally-differential private algorithms for releasing user vectors such that the Jaccard similarity between these vectors can be efficiently estimated. The basic building block is the well known MinHash method. To achieve a privacy-utility trade-off, MinHash is extended in two ways using variants of Generalized Randomized Response and the Laplace Mechanism. A theoretical a… ▽ More This paper describes two locally-differential private algorithms for releasing user vectors such that the Jaccard similarity between these vectors can be efficiently estimated. The basic building block is the well known MinHash method. To achieve a privacy-utility trade-off, MinHash is extended in two ways using variants of Generalized Randomized Response and the Laplace Mechanism. A theoretical analysis provides bounds on the absolute error and experiments show the utility-privacy trade-off on synthetic and real-world data. The paper ends with a critical discussion of related work. △ Less

Submitted 18 August, 2020; originally announced August 2020.

Comments: Accepted at SISAP 2020

arXiv:1907.07387 [pdf, other]

The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search

Authors: Martin Aumüller, Matteo Ceccarello

Abstract: This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization co… ▽ More This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well. △ Less

Submitted 17 July, 2019; originally announced July 2019.

Comments: Preprint of the paper accepted at SISAP 2019

arXiv:1906.12211 [pdf, other]

PUFFINN: Parameterless and Universally Fast FInding of Nearest Neighbors

Authors: Martin Aumüller, Tobias Christiani, Rasmus Pagh, Michael Vesterli

Abstract: We present PUFFINN, a parameterless LSH-based index for solving the $k$-nearest neighbor problem with probabilistic guarantees. By parameterless we mean that the user is only required to specify the amount of memory the index is supposed to use and the result quality that should be achieved. The index combines several heuristic ideas known in the literature. By small adaptions to the query algorit… ▽ More We present PUFFINN, a parameterless LSH-based index for solving the $k$-nearest neighbor problem with probabilistic guarantees. By parameterless we mean that the user is only required to specify the amount of memory the index is supposed to use and the result quality that should be achieved. The index combines several heuristic ideas known in the literature. By small adaptions to the query algorithm, we make heuristics rigorous. We perform experiments on real-world and synthetic inputs to evaluate implementation choices and show that the implementation satisfies the quality guarantees while being competitive with other state-of-the-art approaches to nearest neighbor search. We describe a novel synthetic data set that is difficult to solve for almost all existing nearest neighbor search approaches, and for which PUFFINN significantly outperform previous methods. △ Less

Submitted 28 June, 2019; originally announced June 2019.

Comments: Extended version of the ESA 2019 paper

arXiv:1906.01859 [pdf, other]

doi 10.1145/3375395.3387648

Fair Near Neighbor Search: Independent Range Sampling in High Dimensions

Authors: Martin Aumüller, Rasmus Pagh, Francesco Silvestri

Abstract: Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the $r$-near neighbor ($r$-NN) problem: given a radius $r>0$ and a set of points $S$, construct a data structure that, for any given query point $q$, returns a point $p$ within distance at most $r$ f… ▽ More Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the $r$-near neighbor ($r$-NN) problem: given a radius $r>0$ and a set of points $S$, construct a data structure that, for any given query point $q$, returns a point $p$ within distance at most $r$ from $q$. In this paper, we study the $r$-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance $r$ from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for $r$-NN where all points in $S$ that are near $q$ have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem. △ Less

Submitted 15 June, 2020; v1 submitted 5 June, 2019; originally announced June 2019.

Comments: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS), Pages 191-204, June 2020

arXiv:1810.12047 [pdf, ps, other]

Simple and Fast BlockQuicksort using Lomuto's Partitioning Scheme

Authors: Martin Aumüller, Nikolaj Hass

Abstract: This paper presents simple variants of the BlockQuicksort algorithm described by Edelkamp and Weiss (ESA 2016). The simplification is achieved by using Lomuto's partitioning scheme instead of Hoare's crossing pointer technique to partition the input. To achieve a robust sorting algorithm that works well on many different input types, the paper introduces a novel two-pivot variant of Lomuto's parti… ▽ More This paper presents simple variants of the BlockQuicksort algorithm described by Edelkamp and Weiss (ESA 2016). The simplification is achieved by using Lomuto's partitioning scheme instead of Hoare's crossing pointer technique to partition the input. To achieve a robust sorting algorithm that works well on many different input types, the paper introduces a novel two-pivot variant of Lomuto's partitioning scheme. A surprisingly simple twist to the generic two-pivot quicksort approach makes the algorithm robust. The paper provides an analysis of the theoretical properties of the proposed algorithms and compares them to their competitors. The analysis shows that Lomuto-based approaches incur a higher average sorting cost than the Hoare-based approach of BlockQuicksort. Moreover, the analysis is particularly useful to reason about pivot choices that suit the two-pivot approach. An extensive experimental study shows that, despite their worse theoretical behavior, the simpler variants perform as well as the original version of BlockQuicksort. △ Less

Submitted 29 October, 2018; originally announced October 2018.

Comments: Accepted at ALENEX 2019

ACM Class: F.2.2

arXiv:1807.05614 [pdf, other]

ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms

Authors: Martin Aumüller, Erik Bernhardsson, Alexander Faithfull

Abstract: This paper describes ANN-Benchmarks, a tool for evaluating the performance of in-memory approximate nearest neighbor algorithms. It provides a standard interface for measuring the performance and quality achieved by nearest neighbor algorithms on different standard data sets. It supports several different ways of integrating $k$-NN algorithms, and its configuration system automatically tests a ran… ▽ More This paper describes ANN-Benchmarks, a tool for evaluating the performance of in-memory approximate nearest neighbor algorithms. It provides a standard interface for measuring the performance and quality achieved by nearest neighbor algorithms on different standard data sets. It supports several different ways of integrating $k$-NN algorithms, and its configuration system automatically tests a range of parameter settings for each algorithm. Algorithms are compared with respect to many different (approximate) quality measures, and adding more is easy and fast; the included plotting front-ends can visualise these as images, $\LaTeX$ plots, and websites with interactive plots. ANN-Benchmarks aims to provide a constantly updated overview of the current state of the art of $k$-NN algorithms. In the short term, this overview allows users to choose the correct $k$-NN algorithm and parameters for their similarity search task; in the longer term, algorithm designers will be able to use this overview to test and refine automatic parameter tuning. The paper gives an overview of the system, evaluates the results of the benchmark, and points out directions for future work. Interestingly, very different approaches to $k$-NN search yield comparable quality-performance trade-offs. The system is available at http://ann-benchmarks.com . △ Less

Submitted 17 July, 2018; v1 submitted 15 July, 2018; originally announced July 2018.

Comments: Full version of the SISAP 2017 conference paper. v2: Updated the abstract to avoid arXiv linking to the wrong URL

ACM Class: H.3.3

arXiv:1703.07867 [pdf, other]

Distance-Sensitive hashing

Authors: Martin Aumüller, Tobias Christiani, Rasmus Pagh, Francesco Silvestri

Abstract: Locality-sensitive hashing (LSH) is an important tool for managing high-dimensional noisy or uncertain data, for example in connection with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH framework is not known to yield good solutions, and instead ad hoc solutions have been designed for particular similarity and distance measur… ▽ More Locality-sensitive hashing (LSH) is an important tool for managing high-dimensional noisy or uncertain data, for example in connection with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH framework is not known to yield good solutions, and instead ad hoc solutions have been designed for particular similarity and distance measures. For example, this is true for output-sensitive similarity search/join, and for indexes supporting annulus queries that aim to report a point close to a certain given distance from the query point. In this paper we initiate the study of distance-sensitive hashing (DSH), a generalization of LSH that seeks a family of hash functions such that the probability of two points having the same hash value is a given function of the distance between them. More precisely, given a distance space $(X, \text{dist})$ and a "collision probability function" (CPF) $f\colon \mathbb{R}\rightarrow [0,1]$ we seek a distribution over pairs of functions $(h,g)$ such that for every pair of points $x, y \in X$ the collision probability is $\Pr[h(x)=g(y)] = f(\text{dist}(x,y))$. Locality-sensitive hashing is the study of how fast a CPF can decrease as the distance grows. For many spaces, $f$ can be made exponentially decreasing even if we restrict attention to the symmetric case where $g=h$. We show that the asymmetry achieved by having a pair of functions makes it possible to achieve CPFs that are, for example, increasing or unimodal, and show how this leads to principled solutions to problems not addressed by the LSH framework. This includes a novel application to privacy-preserving distance estimation. We believe that the DSH framework will find further applications in high-dimensional data management. △ Less

Submitted 17 April, 2018; v1 submitted 22 March, 2017; originally announced March 2017.

Comments: Accepted at PODS'18. Abstract shortened due to character limit

ACM Class: H.3.3

arXiv:1611.00258 [pdf, other]

doi 10.1017/S096354831800041X

Dual-Pivot Quicksort: Optimality, Analysis and Zeros of Associated Lattice Paths

Authors: Martin Aumüller, Martin Dietzfelbinger, Clemens Heuberger, Daniel Krenn, Helmut Prodinger

Abstract: We present an average case analysis of a variant of dual-pivot quicksort. We show that the used algorithmic partitioning strategy is optimal, i.e., it minimizes the expected number of key comparisons. For the analysis, we calculate the expected number of comparisons exactly as well as asymptotically, in particular, we provide exact expressions for the linear, logarithmic, and constant terms. An… ▽ More We present an average case analysis of a variant of dual-pivot quicksort. We show that the used algorithmic partitioning strategy is optimal, i.e., it minimizes the expected number of key comparisons. For the analysis, we calculate the expected number of comparisons exactly as well as asymptotically, in particular, we provide exact expressions for the linear, logarithmic, and constant terms. An essential step is the analysis of zeros of lattice paths in a certain probability model. Along the way a combinatorial identity is proven. △ Less

Submitted 27 November, 2017; v1 submitted 1 November, 2016; originally announced November 2016.

Comments: This article supersedes arXiv:1602.04031

MSC Class: 05A16; 68R05; 68P10; 68Q25; 68W40

Journal ref: Combin. Probab. Comput. 28 (2019), no. 4, 485-518

arXiv:1611.00029 [pdf, other]

A Simple Hash Class with Strong Randomness Properties in Graphs and Hypergraphs

Authors: Martin Aumüller, Martin Dietzfelbinger, Philipp Woelfel

Abstract: We study randomness properties of graphs and hypergraphs generated by simple hash functions. Several hashing applications can be analyzed by studying the structure of $d$-uniform random ($d$-partite) hypergraphs obtained from a set $S$ of $n$ keys and $d$ randomly chosen hash functions $h_1,\dots,h_d$ by associating each key $x\in S$ with a hyperedge $\{h_1(x),\dots, h_d(x)\}$. Often it is assumed… ▽ More We study randomness properties of graphs and hypergraphs generated by simple hash functions. Several hashing applications can be analyzed by studying the structure of $d$-uniform random ($d$-partite) hypergraphs obtained from a set $S$ of $n$ keys and $d$ randomly chosen hash functions $h_1,\dots,h_d$ by associating each key $x\in S$ with a hyperedge $\{h_1(x),\dots, h_d(x)\}$. Often it is assumed that $h_1,\dots,h_d$ exhibit a high degree of independence. We present a simple construction of a hash class whose hash functions have small constant evaluation time and can be stored in sublinear space. We devise general techniques to analyze the randomness properties of the graphs and hypergraphs generated by these hash functions, and we show that they can replace other, less efficient constructions in cuckoo hashing (with and without stash), the simulation of a uniform hash function, the construction of a perfect hash function, generalized cuckoo hashing and different load balancing scenarios. △ Less

Submitted 31 October, 2016; originally announced November 2016.

MSC Class: 68P05; 68R10; 68W20; 05C80

arXiv:1605.02673 [pdf, other]

doi 10.1137/1.9781611974782.16

Parameter-free Locality Sensitive Hashing for Spherical Range Reporting

Authors: Thomas D. Ahle, Martin Aumüller, Rasmus Pagh

Abstract: We present a data structure for *spherical range reporting* on a point set $S$, i.e., reporting all points in $S$ that lie within radius $r$ of a given query point $q$. Our solution builds upon the Locality-Sensitive Hashing (LSH) framework of Indyk and Motwani, which represents the asymptotically best solutions to near neighbor problems in high dimensions. While traditional LSH data structures ha… ▽ More We present a data structure for *spherical range reporting* on a point set $S$, i.e., reporting all points in $S$ that lie within radius $r$ of a given query point $q$. Our solution builds upon the Locality-Sensitive Hashing (LSH) framework of Indyk and Motwani, which represents the asymptotically best solutions to near neighbor problems in high dimensions. While traditional LSH data structures have several parameters whose optimal values depend on the distance distribution from $q$ to the points of $S$, our data structure is parameter-free, except for the space usage, which is configurable by the user. Nevertheless, its expected query time basically matches that of an LSH data structure whose parameters have been *optimally chosen for the data and query* in question under the given space constraints. In particular, our data structure provides a smooth trade-off between hard queries (typically addressed by standard LSH) and easy queries such as those where the number of points to report is a constant fraction of $S$, or where almost all points in $S$ are far away from the query point. In contrast, known data structures fix LSH parameters based on certain parameters of the input alone. The algorithm has expected query time bounded by $O(t (n/t)^ρ)$, where $t$ is the number of points to report and $ρ\in (0,1)$ depends on the data distribution and the strength of the LSH family used. We further present a parameter-free way of using multi-probing, for LSH families that support it, and show that for many such families this approach allows us to get expected query time close to $O(n^ρ+t)$, which is the best we can hope to achieve using LSH. The previously best running time in high dimensions was $Ω(t n^ρ)$. For many data distributions where the intrinsic dimensionality of the point set close to $q$ is low, we can give improved upper bounds on the expected query time. △ Less

Submitted 20 July, 2016; v1 submitted 9 May, 2016; originally announced May 2016.

Comments: 21 pages, 5 figures, due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract appearing here is slightly shorter than that in the PDF file

ACM Class: H.3.3

arXiv:1604.02093 [pdf]

doi 10.1016/j.molliq.2016.08.103

Impact of water on the charge transport of a glass-forming ionic liquid

Authors: P. Sippel, V. Dietrich, D. Reuter, M. Aumüller, P. Lunkenheimer, A. Loidl, S. Krohns

Abstract: Using dielectric spectroscopy and differential scanning calorimetry, we have performed a detailed investigation of the influence of water uptake on the translational and reorientational glassy dynamics in the typical ionic liquid 1-Butyl-3-methyl-imidazolium chloride. From a careful analysis of the measured dielectric permittivity and conductivity spectra, we find a significant acceleration of cat… ▽ More Using dielectric spectroscopy and differential scanning calorimetry, we have performed a detailed investigation of the influence of water uptake on the translational and reorientational glassy dynamics in the typical ionic liquid 1-Butyl-3-methyl-imidazolium chloride. From a careful analysis of the measured dielectric permittivity and conductivity spectra, we find a significant acceleration of cation reorientation and a marked increase of the ionic conductivity for higher water contents. The latter effect mainly arises due to a strong impact of water content on the glass temperature, which for the well-dried material is found to be larger than any values reported in literature for this system. The fragility, characterizing the non-Arrhenius glassy dynamics of the ionic subsystem, also changes with varying water content. Decoupling of the ionic motion from the structural dynamics has to be considered to explain the results. △ Less

Submitted 7 April, 2016; originally announced April 2016.

Comments: 10 pages, 7 figures

Journal ref: J. Mol. Liq. 223 (2016) 635

arXiv:1602.04031 [pdf, other]

Counting Zeros in Random Walks on the Integers and Analysis of Optimal Dual-Pivot Quicksort

Authors: Martin Aumüller, Martin Dietzfelbinger, Clemens Heuberger, Daniel Krenn, Helmut Prodinger

Abstract: We present an average case analysis of two variants of dual-pivot quicksort, one with a non-algorithmic comparison-optimal partitioning strategy, the other with a closely related algorithmic strategy. For both we calculate the expected number of comparisons exactly as well as asymptotically, in particular, we provide exact expressions for the linear, logarithmic, and constant terms. An essential s… ▽ More We present an average case analysis of two variants of dual-pivot quicksort, one with a non-algorithmic comparison-optimal partitioning strategy, the other with a closely related algorithmic strategy. For both we calculate the expected number of comparisons exactly as well as asymptotically, in particular, we provide exact expressions for the linear, logarithmic, and constant terms. An essential step is the analysis of zeros of lattice paths in a certain probability model. Along the way a combinatorial identity is proven. △ Less

Submitted 11 May, 2016; v1 submitted 12 February, 2016; originally announced February 2016.

Comments: extended abstract

MSC Class: 05A16; 68R05; 68P10; 68Q25; 68W40

arXiv:1510.04676 [pdf, ps, other]

How Good is Multi-Pivot Quicksort?

Authors: Martin Aumüller, Martin Dietzfelbinger, Pascal Klaue

Abstract: Multi-Pivot Quicksort refers to variants of classical quicksort where in the partitioning step $k$ pivots are used to split the input into $k + 1$ segments. For many years, multi-pivot quicksort was regarded as impractical, but in 2009 a 2-pivot approach by Yaroslavskiy, Bentley, and Bloch was chosen as the standard sorting algorithm in Sun's Java 7. In 2014 at ALENEX, Kushagra et al. introduced a… ▽ More Multi-Pivot Quicksort refers to variants of classical quicksort where in the partitioning step $k$ pivots are used to split the input into $k + 1$ segments. For many years, multi-pivot quicksort was regarded as impractical, but in 2009 a 2-pivot approach by Yaroslavskiy, Bentley, and Bloch was chosen as the standard sorting algorithm in Sun's Java 7. In 2014 at ALENEX, Kushagra et al. introduced an even faster algorithm that uses three pivots. This paper studies what possible advantages multi-pivot quicksort might offer in general. The contributions are as follows: Natural comparison-optimal algorithms for multi-pivot quicksort are devised and analyzed. The analysis shows that the benefits of using multiple pivots with respect to the average comparison count are marginal and these strategies are inferior to simpler strategies such as the well known median-of-$k$ approach. A substantial part of the partitioning cost is caused by rearranging elements. A rigorous analysis of an algorithm for rearranging elements in the partitioning step is carried out, observing mainly how often array cells are accessed during partitioning. The algorithm behaves best if 3 to 5 pivots are used. Experiments show that this translates into good cache behavior and is closest to predicting observed running times of multi-pivot quicksort algorithms. Finally, it is studied how choosing pivots from a sample affects sorting cost. The study is theoretical in the sense that although the findings motivate design recommendations for multipivot quicksort algorithms that lead to running time improvements over known algorithms in an experimental setting, these improvements are small. △ Less

Submitted 31 May, 2016; v1 submitted 15 October, 2015; originally announced October 2015.

Comments: Submitted to a journal, v2: Fixed statement of Gibb's inequality, v3: Revised version, especially improving on the experiments in Section 9

ACM Class: F.2.2

arXiv:1303.5217 [pdf, other]

Optimal Partitioning for Dual-Pivot Quicksort

Authors: Martin Aumüller, Martin Dietzfelbinger

Abstract: Dual-pivot quicksort refers to variants of classical quicksort where in the partitioning step two pivots are used to split the input into three segments. This can be done in different ways, giving rise to different algorithms. Recently, a dual-pivot algorithm proposed by Yaroslavskiy received much attention, because a variant of it replaced the well-engineered quicksort algorithm in Sun's Java 7 r… ▽ More Dual-pivot quicksort refers to variants of classical quicksort where in the partitioning step two pivots are used to split the input into three segments. This can be done in different ways, giving rise to different algorithms. Recently, a dual-pivot algorithm proposed by Yaroslavskiy received much attention, because a variant of it replaced the well-engineered quicksort algorithm in Sun's Java 7 runtime library. Nebel and Wild (ESA 2012) analyzed this algorithm and showed that on average it uses 1.9n ln n + O(n) comparisons to sort an input of size n, beating standard quicksort, which uses 2n ln n + O(n) comparisons. We introduce a model that captures all dual-pivot algorithms, give a unified analysis, and identify new dual-pivot algorithms that minimize the average number of key comparisons among all possible algorithms up to a linear term. This minimum is 1.8n ln n + O(n). For the case that the pivots are chosen from a small sample, we include a comparison of dual-pivot quicksort and classical quicksort. Specifically, we show that dual-pivot quicksort benefits from a skewed choice of pivots. We experimentally evaluate our algorithms and compare them to Yaroslavskiy's algorithm and the recently described three-pivot quicksort algorithm of Kushagra et al. (ALENEX 2014). △ Less

Submitted 13 October, 2015; v1 submitted 21 March, 2013; originally announced March 2013.

Comments: Accepted for publication in ACM Transactions on Algorithms

arXiv:1204.4431 [pdf, ps, other]

Explicit and Efficient Hash Families Suffice for Cuckoo Hashing with a Stash

Authors: Martin Aumüller, Martin Dietzfelbinger, Philipp Woelfel

Abstract: It is shown that for cuckoo hashing with a stash as proposed by Kirsch, Mitzenmacher, and Wieder (2008) families of very simple hash functions can be used, maintaining the favorable performance guarantees: with stash size $s$ the probability of a rehash is $O(1/n^{s+1})$, and the evaluation time is $O(s)$. Instead of the full randomness needed for the analysis of Kirsch et al. and of Kutzelnigg (2… ▽ More It is shown that for cuckoo hashing with a stash as proposed by Kirsch, Mitzenmacher, and Wieder (2008) families of very simple hash functions can be used, maintaining the favorable performance guarantees: with stash size $s$ the probability of a rehash is $O(1/n^{s+1})$, and the evaluation time is $O(s)$. Instead of the full randomness needed for the analysis of Kirsch et al. and of Kutzelnigg (2010) (resp. $Θ(\log n)$-wise independence for standard cuckoo hashing) the new approach even works with 2-wise independent hash families as building blocks. Both construction and analysis build upon the work of Dietzfelbinger and Woelfel (2003). The analysis, which can also be applied to the fully random case, utilizes a graph counting argument and is much simpler than previous proofs. As a byproduct, an algorithm for simulating uniform hashing is obtained. While it requires about twice as much space as the most space efficient solutions, it is attractive because of its simple and direct structure. △ Less

Submitted 19 April, 2012; originally announced April 2012.

Comments: 18 Pages

ACM Class: F.2.2

Showing 1–23 of 23 results for author: Aumüller, M