-
VIBE: Vector Index Benchmark for Embeddings
Authors:
Elias Jääsaari,
Ville Hyvönen,
Matteo Ceccarello,
Teemu Roos,
Martin Aumüller
Abstract:
Approximate nearest neighbor (ANN) search is a performance-critical component of many machine learning pipelines. Rigorous benchmarking is essential for evaluating the performance of vector indexes for ANN search. However, the datasets of the existing benchmarks are no longer representative of the current applications of ANN search. Hence, there is an urgent need for an up-to-date set of benchmark…
▽ More
Approximate nearest neighbor (ANN) search is a performance-critical component of many machine learning pipelines. Rigorous benchmarking is essential for evaluating the performance of vector indexes for ANN search. However, the datasets of the existing benchmarks are no longer representative of the current applications of ANN search. Hence, there is an urgent need for an up-to-date set of benchmarks. To this end, we introduce Vector Index Benchmark for Embeddings (VIBE), an open source project for benchmarking ANN algorithms. VIBE contains a pipeline for creating benchmark datasets using dense embedding models characteristic of modern applications, such as retrieval-augmented generation (RAG). To replicate real-world workloads, we also include out-of-distribution (OOD) datasets where the queries and the corpus are drawn from different distributions. We use VIBE to conduct a comprehensive evaluation of SOTA vector indexes, benchmarking 21 implementations on 12 in-distribution and 6 out-of-distribution datasets.
△ Less
Submitted 23 May, 2025;
originally announced May 2025.
-
Results of the Big ANN: NeurIPS'23 competition
Authors:
Harsha Vardhan Simhadri,
Martin Aumüller,
Amir Ingber,
Matthijs Douze,
George Williams,
Magdalen Dobson Manohar,
Dmitry Baranchuk,
Edo Liberty,
Frank Liu,
Ben Landrum,
Mazin Karjikar,
Laxman Dhulipala,
Meng Chen,
Yue Chen,
Rui Ma,
Kai Zhang,
Yuzheng Cai,
Jiayang Shi,
Yizhuo Chen,
Weiguo Zheng,
Zihao Wan,
Jie Yin,
Ben Huang
Abstract:
The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search ~\cite{DBLP:conf/nips/SimhadriWADBBCH21}, this competi…
▽ More
The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search ~\cite{DBLP:conf/nips/SimhadriWADBBCH21}, this competition addressed filtered search, out-of-distribution data, sparse and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources. The results showcased significant improvements in search accuracy and efficiency over industry-standard baselines, with notable contributions from both academic and industrial teams. This paper summarizes the competition tracks, datasets, evaluation metrics, and the innovative approaches of the top-performing submissions, providing insights into the current advancements and future directions in the field of approximate nearest neighbor search.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
Differentially Private High-Dimensional Approximate Range Counting, Revisited
Authors:
Martin Aumüller,
Fabrizio Boninsegna,
Francesco Silvestri
Abstract:
Locality Sensitive Filters are known for offering a quasi-linear space data structure with rigorous guarantees for the Approximate Near Neighbor search (ANN) problem. Building on Locality Sensitive Filters, we derive a simple data structure for the Approximate Near Neighbor Counting (ANNC) problem under differential privacy (DP). Moreover, we provide a simple analysis leveraging a connection with…
▽ More
Locality Sensitive Filters are known for offering a quasi-linear space data structure with rigorous guarantees for the Approximate Near Neighbor search (ANN) problem. Building on Locality Sensitive Filters, we derive a simple data structure for the Approximate Near Neighbor Counting (ANNC) problem under differential privacy (DP). Moreover, we provide a simple analysis leveraging a connection with concomitant statistics and extreme value theory. Our approach produces a simple data structure with a tunable parameter that regulates a trade-off between space-time and utility. Through this trade-off, our data structure achieves the same performance as the recent findings of Andoni et al. (NeurIPS 2023) while offering better utility at the cost of higher space and query time. In addition, we provide a more efficient algorithm under pure $\varepsilon$-DP and elucidate the connection between ANN and differentially private ANNC. As a side result, the paper provides a more compact description and analysis of Locality Sensitive Filters for Fair Near Neighbor Search, improving a previous result in Aumüller et al. (TODS 2022).
△ Less
Submitted 2 May, 2025; v1 submitted 11 September, 2024;
originally announced September 2024.
-
PLAN: Variance-Aware Private Mean Estimation
Authors:
Martin Aumüller,
Christian Janos Lebeda,
Boel Nelson,
Rasmus Pagh
Abstract:
Differentially private mean estimation is an important building block in privacy-preserving algorithms for data analysis and machine learning. Though the trade-off between privacy and utility is well understood in the worst case, many datasets exhibit structure that could potentially be exploited to yield better algorithms. In this paper we present $\textit{Private Limit Adapted Noise}$ (PLAN), a…
▽ More
Differentially private mean estimation is an important building block in privacy-preserving algorithms for data analysis and machine learning. Though the trade-off between privacy and utility is well understood in the worst case, many datasets exhibit structure that could potentially be exploited to yield better algorithms. In this paper we present $\textit{Private Limit Adapted Noise}$ (PLAN), a family of differentially private algorithms for mean estimation in the setting where inputs are independently sampled from a distribution $\mathcal{D}$ over $\mathbf{R}^d$, with coordinate-wise standard deviations $\boldsymbolσ \in \mathbf{R}^d$. Similar to mean estimation under Mahalanobis distance, PLAN tailors the shape of the noise to the shape of the data, but unlike previous algorithms the privacy budget is spent non-uniformly over the coordinates. Under a concentration assumption on $\mathcal{D}$, we show how to exploit skew in the vector $\boldsymbolσ$, obtaining a (zero-concentrated) differentially private mean estimate with $\ell_2$ error proportional to $\|\boldsymbolσ\|_1$. Previous work has either not taken $\boldsymbolσ$ into account, or measured error in Mahalanobis distance $\unicode{x2013}$ in both cases resulting in $\ell_2$ error proportional to $\sqrt{d}\|\boldsymbolσ\|_2$, which can be up to a factor $\sqrt{d}$ larger. To verify the effectiveness of PLAN, we empirically evaluate accuracy on both synthetic and real world data.
△ Less
Submitted 10 April, 2024; v1 submitted 14 June, 2023;
originally announced June 2023.
-
Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search
Authors:
Harsha Vardhan Simhadri,
George Williams,
Martin Aumüller,
Matthijs Douze,
Artem Babenko,
Dmitry Baranchuk,
Qi Chen,
Lucas Hosseini,
Ravishankar Krishnaswamy,
Gopal Srinivasa,
Suhas Jayaram Subramanya,
Jingdong Wang
Abstract:
Despite the broad range of algorithms for Approximate Nearest Neighbor Search, most empirical evaluations of algorithms have focused on smaller datasets, typically of 1 million points~\citep{Benchmark}. However, deploying recent advances in embedding based techniques for search, recommendation and ranking at scale require ANNS indices at billion, trillion or larger scale. Barring a few recent pape…
▽ More
Despite the broad range of algorithms for Approximate Nearest Neighbor Search, most empirical evaluations of algorithms have focused on smaller datasets, typically of 1 million points~\citep{Benchmark}. However, deploying recent advances in embedding based techniques for search, recommendation and ranking at scale require ANNS indices at billion, trillion or larger scale. Barring a few recent papers, there is limited consensus on which algorithms are effective at this scale vis-à-vis their hardware cost.
This competition compares ANNS algorithms at billion-scale by hardware cost, accuracy and performance. We set up an open source evaluation framework and leaderboards for both standardized and specialized hardware. The competition involves three tracks. The standard hardware track T1 evaluates algorithms on an Azure VM with limited DRAM, often the bottleneck in serving billion-scale indices, where the embedding data can be hundreds of GigaBytes in size. It uses FAISS~\citep{Faiss17} as the baseline. The standard hardware track T2 additional allows inexpensive SSDs in addition to the limited DRAM and uses DiskANN~\citep{DiskANN19} as the baseline. The specialized hardware track T3 allows any hardware configuration, and again uses FAISS as the baseline.
We compiled six diverse billion-scale datasets, four newly released for this competition, that span a variety of modalities, data types, dimensions, deep learning models, distance functions and sources. The outcome of the competition was ranked leaderboards of algorithms in each track based on recall at a query throughput threshold. Additionally, for track T3, separate leaderboards were created based on recall as well as cost-normalized and power-normalized query throughput.
△ Less
Submitted 7 May, 2022;
originally announced May 2022.
-
DEANN: Speeding up Kernel-Density Estimation using Approximate Nearest Neighbor Search
Authors:
Matti Karppa,
Martin Aumüller,
Rasmus Pagh
Abstract:
Kernel Density Estimation (KDE) is a nonparametric method for estimating the shape of a density function, given a set of samples from the distribution. Recently, locality-sensitive hashing, originally proposed as a tool for nearest neighbor search, has been shown to enable fast KDE data structures. However, these approaches do not take advantage of the many other advances that have been made in al…
▽ More
Kernel Density Estimation (KDE) is a nonparametric method for estimating the shape of a density function, given a set of samples from the distribution. Recently, locality-sensitive hashing, originally proposed as a tool for nearest neighbor search, has been shown to enable fast KDE data structures. However, these approaches do not take advantage of the many other advances that have been made in algorithms for nearest neighbor algorithms. We present an algorithm called Density Estimation from Approximate Nearest Neighbors (DEANN) where we apply Approximate Nearest Neighbor (ANN) algorithms as a black box subroutine to compute an unbiased KDE. The idea is to find points that have a large contribution to the KDE using ANN, compute their contribution exactly, and approximate the remainder with Random Sampling (RS). We present a theoretical argument that supports the idea that an ANN subroutine can speed up the evaluation. Furthermore, we provide a C++ implementation with a Python interface that can make use of an arbitrary ANN implementation as a subroutine for kernel density estimation. We show empirically that our implementation outperforms state of the art implementations in all high dimensional datasets we considered, and matches the performance of RS in cases where the ANN yield no gains in performance.
△ Less
Submitted 1 March, 2022; v1 submitted 6 July, 2021;
originally announced July 2021.
-
Differentially Private Sparse Vectors with Low Error, Optimal Space, and Fast Access
Authors:
Martin Aumüller,
Christian Janos Lebeda,
Rasmus Pagh
Abstract:
Representing a sparse histogram, or more generally a sparse vector, is a fundamental task in differential privacy. An ideal solution would use space close to information-theoretical lower bounds, have an error distribution that depends optimally on the desired privacy level, and allow fast random access to entries in the vector. However, existing approaches have only achieved two of these three go…
▽ More
Representing a sparse histogram, or more generally a sparse vector, is a fundamental task in differential privacy. An ideal solution would use space close to information-theoretical lower bounds, have an error distribution that depends optimally on the desired privacy level, and allow fast random access to entries in the vector. However, existing approaches have only achieved two of these three goals.
In this paper we introduce the Approximate Laplace Projection (ALP) mechanism for approximating k-sparse vectors. This mechanism is shown to simultaneously have information-theoretically optimal space (up to constant factors), fast access to vector entries, and error of the same magnitude as the Laplace-mechanism applied to dense vectors. A key new technique is a unary representation of small integers, which we show to be robust against ``randomized response'' noise. This representation is combined with hashing, in the spirit of Bloom filters, to obtain a space-efficient, differentially private representation.
Our theoretical performance bounds are complemented by simulations which show that the constant factors on the main performance parameters are quite small, suggesting practicality of the technique.
△ Less
Submitted 27 September, 2021; v1 submitted 18 June, 2021;
originally announced June 2021.
-
Sampling a Near Neighbor in High Dimensions -- Who is the Fairest of Them All?
Authors:
Martin Aumüller,
Sariel Har-Peled,
Sepideh Mahabadi,
Rasmus Pagh,
Francesco Silvestri
Abstract:
Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points $S$ and a radius parameter $r>0$, the $r$-near neighbor ($r$-NN) problem asks for a data structure that, given any query point $q$, returns a point $p$ within distance at most $r$ from $q$. In this paper, we study the $r$-NN problem in the light of individual fairness a…
▽ More
Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points $S$ and a radius parameter $r>0$, the $r$-near neighbor ($r$-NN) problem asks for a data structure that, given any query point $q$, returns a point $p$ within distance at most $r$ from $q$. In this paper, we study the $r$-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance $r$ from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. In this work, we show that LSH based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights the inherent unfairness of NN data structures and shows the performance of our algorithms on real-world datasets.
△ Less
Submitted 26 January, 2021;
originally announced January 2021.
-
Differentially Private Sketches for Jaccard Similarity Estimation
Authors:
Martin Aumüller,
Anders Bourgeat,
Jana Schmurr
Abstract:
This paper describes two locally-differential private algorithms for releasing user vectors such that the Jaccard similarity between these vectors can be efficiently estimated. The basic building block is the well known MinHash method. To achieve a privacy-utility trade-off, MinHash is extended in two ways using variants of Generalized Randomized Response and the Laplace Mechanism. A theoretical a…
▽ More
This paper describes two locally-differential private algorithms for releasing user vectors such that the Jaccard similarity between these vectors can be efficiently estimated. The basic building block is the well known MinHash method. To achieve a privacy-utility trade-off, MinHash is extended in two ways using variants of Generalized Randomized Response and the Laplace Mechanism. A theoretical analysis provides bounds on the absolute error and experiments show the utility-privacy trade-off on synthetic and real-world data. The paper ends with a critical discussion of related work.
△ Less
Submitted 18 August, 2020;
originally announced August 2020.
-
The Role of Local Intrinsic Dimensionality in Benchmarking Nearest Neighbor Search
Authors:
Martin Aumüller,
Matteo Ceccarello
Abstract:
This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization co…
▽ More
This paper reconsiders common benchmarking approaches to nearest neighbor search. It is shown that the concept of local intrinsic dimensionality (LID) allows to choose query sets of a wide range of difficulty for real-world datasets. Moreover, the effect of different LID distributions on the running time performance of implementations is empirically studied. To this end, different visualization concepts are introduced that allow to get a more fine-grained overview of the inner workings of nearest neighbor search principles. The paper closes with remarks about the diversity of datasets commonly used for nearest neighbor search benchmarking. It is shown that such real-world datasets are not diverse: results on a single dataset predict results on all other datasets well.
△ Less
Submitted 17 July, 2019;
originally announced July 2019.
-
PUFFINN: Parameterless and Universally Fast FInding of Nearest Neighbors
Authors:
Martin Aumüller,
Tobias Christiani,
Rasmus Pagh,
Michael Vesterli
Abstract:
We present PUFFINN, a parameterless LSH-based index for solving the $k$-nearest neighbor problem with probabilistic guarantees. By parameterless we mean that the user is only required to specify the amount of memory the index is supposed to use and the result quality that should be achieved. The index combines several heuristic ideas known in the literature. By small adaptions to the query algorit…
▽ More
We present PUFFINN, a parameterless LSH-based index for solving the $k$-nearest neighbor problem with probabilistic guarantees. By parameterless we mean that the user is only required to specify the amount of memory the index is supposed to use and the result quality that should be achieved. The index combines several heuristic ideas known in the literature. By small adaptions to the query algorithm, we make heuristics rigorous. We perform experiments on real-world and synthetic inputs to evaluate implementation choices and show that the implementation satisfies the quality guarantees while being competitive with other state-of-the-art approaches to nearest neighbor search.
We describe a novel synthetic data set that is difficult to solve for almost all existing nearest neighbor search approaches, and for which PUFFINN significantly outperform previous methods.
△ Less
Submitted 28 June, 2019;
originally announced June 2019.
-
Fair Near Neighbor Search: Independent Range Sampling in High Dimensions
Authors:
Martin Aumüller,
Rasmus Pagh,
Francesco Silvestri
Abstract:
Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the $r$-near neighbor ($r$-NN) problem: given a radius $r>0$ and a set of points $S$, construct a data structure that, for any given query point $q$, returns a point $p$ within distance at most $r$ f…
▽ More
Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the $r$-near neighbor ($r$-NN) problem: given a radius $r>0$ and a set of points $S$, construct a data structure that, for any given query point $q$, returns a point $p$ within distance at most $r$ from $q$. In this paper, we study the $r$-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance $r$ from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for $r$-NN where all points in $S$ that are near $q$ have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem.
△ Less
Submitted 15 June, 2020; v1 submitted 5 June, 2019;
originally announced June 2019.
-
Simple and Fast BlockQuicksort using Lomuto's Partitioning Scheme
Authors:
Martin Aumüller,
Nikolaj Hass
Abstract:
This paper presents simple variants of the BlockQuicksort algorithm described by Edelkamp and Weiss (ESA 2016). The simplification is achieved by using Lomuto's partitioning scheme instead of Hoare's crossing pointer technique to partition the input. To achieve a robust sorting algorithm that works well on many different input types, the paper introduces a novel two-pivot variant of Lomuto's parti…
▽ More
This paper presents simple variants of the BlockQuicksort algorithm described by Edelkamp and Weiss (ESA 2016). The simplification is achieved by using Lomuto's partitioning scheme instead of Hoare's crossing pointer technique to partition the input. To achieve a robust sorting algorithm that works well on many different input types, the paper introduces a novel two-pivot variant of Lomuto's partitioning scheme. A surprisingly simple twist to the generic two-pivot quicksort approach makes the algorithm robust. The paper provides an analysis of the theoretical properties of the proposed algorithms and compares them to their competitors. The analysis shows that Lomuto-based approaches incur a higher average sorting cost than the Hoare-based approach of BlockQuicksort. Moreover, the analysis is particularly useful to reason about pivot choices that suit the two-pivot approach. An extensive experimental study shows that, despite their worse theoretical behavior, the simpler variants perform as well as the original version of BlockQuicksort.
△ Less
Submitted 29 October, 2018;
originally announced October 2018.
-
ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms
Authors:
Martin Aumüller,
Erik Bernhardsson,
Alexander Faithfull
Abstract:
This paper describes ANN-Benchmarks, a tool for evaluating the performance of in-memory approximate nearest neighbor algorithms. It provides a standard interface for measuring the performance and quality achieved by nearest neighbor algorithms on different standard data sets. It supports several different ways of integrating $k$-NN algorithms, and its configuration system automatically tests a ran…
▽ More
This paper describes ANN-Benchmarks, a tool for evaluating the performance of in-memory approximate nearest neighbor algorithms. It provides a standard interface for measuring the performance and quality achieved by nearest neighbor algorithms on different standard data sets. It supports several different ways of integrating $k$-NN algorithms, and its configuration system automatically tests a range of parameter settings for each algorithm. Algorithms are compared with respect to many different (approximate) quality measures, and adding more is easy and fast; the included plotting front-ends can visualise these as images, $\LaTeX$ plots, and websites with interactive plots. ANN-Benchmarks aims to provide a constantly updated overview of the current state of the art of $k$-NN algorithms. In the short term, this overview allows users to choose the correct $k$-NN algorithm and parameters for their similarity search task; in the longer term, algorithm designers will be able to use this overview to test and refine automatic parameter tuning. The paper gives an overview of the system, evaluates the results of the benchmark, and points out directions for future work. Interestingly, very different approaches to $k$-NN search yield comparable quality-performance trade-offs. The system is available at http://ann-benchmarks.com .
△ Less
Submitted 17 July, 2018; v1 submitted 15 July, 2018;
originally announced July 2018.
-
Distance-Sensitive hashing
Authors:
Martin Aumüller,
Tobias Christiani,
Rasmus Pagh,
Francesco Silvestri
Abstract:
Locality-sensitive hashing (LSH) is an important tool for managing high-dimensional noisy or uncertain data, for example in connection with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH framework is not known to yield good solutions, and instead ad hoc solutions have been designed for particular similarity and distance measur…
▽ More
Locality-sensitive hashing (LSH) is an important tool for managing high-dimensional noisy or uncertain data, for example in connection with data cleaning (similarity join) and noise-robust search (similarity search). However, for a number of problems the LSH framework is not known to yield good solutions, and instead ad hoc solutions have been designed for particular similarity and distance measures. For example, this is true for output-sensitive similarity search/join, and for indexes supporting annulus queries that aim to report a point close to a certain given distance from the query point.
In this paper we initiate the study of distance-sensitive hashing (DSH), a generalization of LSH that seeks a family of hash functions such that the probability of two points having the same hash value is a given function of the distance between them. More precisely, given a distance space $(X, \text{dist})$ and a "collision probability function" (CPF) $f\colon \mathbb{R}\rightarrow [0,1]$ we seek a distribution over pairs of functions $(h,g)$ such that for every pair of points $x, y \in X$ the collision probability is $\Pr[h(x)=g(y)] = f(\text{dist}(x,y))$. Locality-sensitive hashing is the study of how fast a CPF can decrease as the distance grows. For many spaces, $f$ can be made exponentially decreasing even if we restrict attention to the symmetric case where $g=h$. We show that the asymmetry achieved by having a pair of functions makes it possible to achieve CPFs that are, for example, increasing or unimodal, and show how this leads to principled solutions to problems not addressed by the LSH framework. This includes a novel application to privacy-preserving distance estimation. We believe that the DSH framework will find further applications in high-dimensional data management.
△ Less
Submitted 17 April, 2018; v1 submitted 22 March, 2017;
originally announced March 2017.
-
Dual-Pivot Quicksort: Optimality, Analysis and Zeros of Associated Lattice Paths
Authors:
Martin Aumüller,
Martin Dietzfelbinger,
Clemens Heuberger,
Daniel Krenn,
Helmut Prodinger
Abstract:
We present an average case analysis of a variant of dual-pivot quicksort. We show that the used algorithmic partitioning strategy is optimal, i.e., it minimizes the expected number of key comparisons. For the analysis, we calculate the expected number of comparisons exactly as well as asymptotically, in particular, we provide exact expressions for the linear, logarithmic, and constant terms.
An…
▽ More
We present an average case analysis of a variant of dual-pivot quicksort. We show that the used algorithmic partitioning strategy is optimal, i.e., it minimizes the expected number of key comparisons. For the analysis, we calculate the expected number of comparisons exactly as well as asymptotically, in particular, we provide exact expressions for the linear, logarithmic, and constant terms.
An essential step is the analysis of zeros of lattice paths in a certain probability model. Along the way a combinatorial identity is proven.
△ Less
Submitted 27 November, 2017; v1 submitted 1 November, 2016;
originally announced November 2016.
-
A Simple Hash Class with Strong Randomness Properties in Graphs and Hypergraphs
Authors:
Martin Aumüller,
Martin Dietzfelbinger,
Philipp Woelfel
Abstract:
We study randomness properties of graphs and hypergraphs generated by simple hash functions. Several hashing applications can be analyzed by studying the structure of $d$-uniform random ($d$-partite) hypergraphs obtained from a set $S$ of $n$ keys and $d$ randomly chosen hash functions $h_1,\dots,h_d$ by associating each key $x\in S$ with a hyperedge $\{h_1(x),\dots, h_d(x)\}$. Often it is assumed…
▽ More
We study randomness properties of graphs and hypergraphs generated by simple hash functions. Several hashing applications can be analyzed by studying the structure of $d$-uniform random ($d$-partite) hypergraphs obtained from a set $S$ of $n$ keys and $d$ randomly chosen hash functions $h_1,\dots,h_d$ by associating each key $x\in S$ with a hyperedge $\{h_1(x),\dots, h_d(x)\}$. Often it is assumed that $h_1,\dots,h_d$ exhibit a high degree of independence. We present a simple construction of a hash class whose hash functions have small constant evaluation time and can be stored in sublinear space. We devise general techniques to analyze the randomness properties of the graphs and hypergraphs generated by these hash functions, and we show that they can replace other, less efficient constructions in cuckoo hashing (with and without stash), the simulation of a uniform hash function, the construction of a perfect hash function, generalized cuckoo hashing and different load balancing scenarios.
△ Less
Submitted 31 October, 2016;
originally announced November 2016.
-
Parameter-free Locality Sensitive Hashing for Spherical Range Reporting
Authors:
Thomas D. Ahle,
Martin Aumüller,
Rasmus Pagh
Abstract:
We present a data structure for *spherical range reporting* on a point set $S$, i.e., reporting all points in $S$ that lie within radius $r$ of a given query point $q$. Our solution builds upon the Locality-Sensitive Hashing (LSH) framework of Indyk and Motwani, which represents the asymptotically best solutions to near neighbor problems in high dimensions. While traditional LSH data structures ha…
▽ More
We present a data structure for *spherical range reporting* on a point set $S$, i.e., reporting all points in $S$ that lie within radius $r$ of a given query point $q$. Our solution builds upon the Locality-Sensitive Hashing (LSH) framework of Indyk and Motwani, which represents the asymptotically best solutions to near neighbor problems in high dimensions. While traditional LSH data structures have several parameters whose optimal values depend on the distance distribution from $q$ to the points of $S$, our data structure is parameter-free, except for the space usage, which is configurable by the user. Nevertheless, its expected query time basically matches that of an LSH data structure whose parameters have been *optimally chosen for the data and query* in question under the given space constraints. In particular, our data structure provides a smooth trade-off between hard queries (typically addressed by standard LSH) and easy queries such as those where the number of points to report is a constant fraction of $S$, or where almost all points in $S$ are far away from the query point. In contrast, known data structures fix LSH parameters based on certain parameters of the input alone.
The algorithm has expected query time bounded by $O(t (n/t)^ρ)$, where $t$ is the number of points to report and $ρ\in (0,1)$ depends on the data distribution and the strength of the LSH family used. We further present a parameter-free way of using multi-probing, for LSH families that support it, and show that for many such families this approach allows us to get expected query time close to $O(n^ρ+t)$, which is the best we can hope to achieve using LSH. The previously best running time in high dimensions was $Ω(t n^ρ)$. For many data distributions where the intrinsic dimensionality of the point set close to $q$ is low, we can give improved upper bounds on the expected query time.
△ Less
Submitted 20 July, 2016; v1 submitted 9 May, 2016;
originally announced May 2016.
-
Impact of water on the charge transport of a glass-forming ionic liquid
Authors:
P. Sippel,
V. Dietrich,
D. Reuter,
M. Aumüller,
P. Lunkenheimer,
A. Loidl,
S. Krohns
Abstract:
Using dielectric spectroscopy and differential scanning calorimetry, we have performed a detailed investigation of the influence of water uptake on the translational and reorientational glassy dynamics in the typical ionic liquid 1-Butyl-3-methyl-imidazolium chloride. From a careful analysis of the measured dielectric permittivity and conductivity spectra, we find a significant acceleration of cat…
▽ More
Using dielectric spectroscopy and differential scanning calorimetry, we have performed a detailed investigation of the influence of water uptake on the translational and reorientational glassy dynamics in the typical ionic liquid 1-Butyl-3-methyl-imidazolium chloride. From a careful analysis of the measured dielectric permittivity and conductivity spectra, we find a significant acceleration of cation reorientation and a marked increase of the ionic conductivity for higher water contents. The latter effect mainly arises due to a strong impact of water content on the glass temperature, which for the well-dried material is found to be larger than any values reported in literature for this system. The fragility, characterizing the non-Arrhenius glassy dynamics of the ionic subsystem, also changes with varying water content. Decoupling of the ionic motion from the structural dynamics has to be considered to explain the results.
△ Less
Submitted 7 April, 2016;
originally announced April 2016.
-
Counting Zeros in Random Walks on the Integers and Analysis of Optimal Dual-Pivot Quicksort
Authors:
Martin Aumüller,
Martin Dietzfelbinger,
Clemens Heuberger,
Daniel Krenn,
Helmut Prodinger
Abstract:
We present an average case analysis of two variants of dual-pivot quicksort, one with a non-algorithmic comparison-optimal partitioning strategy, the other with a closely related algorithmic strategy. For both we calculate the expected number of comparisons exactly as well as asymptotically, in particular, we provide exact expressions for the linear, logarithmic, and constant terms. An essential s…
▽ More
We present an average case analysis of two variants of dual-pivot quicksort, one with a non-algorithmic comparison-optimal partitioning strategy, the other with a closely related algorithmic strategy. For both we calculate the expected number of comparisons exactly as well as asymptotically, in particular, we provide exact expressions for the linear, logarithmic, and constant terms. An essential step is the analysis of zeros of lattice paths in a certain probability model. Along the way a combinatorial identity is proven.
△ Less
Submitted 11 May, 2016; v1 submitted 12 February, 2016;
originally announced February 2016.
-
How Good is Multi-Pivot Quicksort?
Authors:
Martin Aumüller,
Martin Dietzfelbinger,
Pascal Klaue
Abstract:
Multi-Pivot Quicksort refers to variants of classical quicksort where in the partitioning step $k$ pivots are used to split the input into $k + 1$ segments. For many years, multi-pivot quicksort was regarded as impractical, but in 2009 a 2-pivot approach by Yaroslavskiy, Bentley, and Bloch was chosen as the standard sorting algorithm in Sun's Java 7. In 2014 at ALENEX, Kushagra et al. introduced a…
▽ More
Multi-Pivot Quicksort refers to variants of classical quicksort where in the partitioning step $k$ pivots are used to split the input into $k + 1$ segments. For many years, multi-pivot quicksort was regarded as impractical, but in 2009 a 2-pivot approach by Yaroslavskiy, Bentley, and Bloch was chosen as the standard sorting algorithm in Sun's Java 7. In 2014 at ALENEX, Kushagra et al. introduced an even faster algorithm that uses three pivots. This paper studies what possible advantages multi-pivot quicksort might offer in general. The contributions are as follows: Natural comparison-optimal algorithms for multi-pivot quicksort are devised and analyzed. The analysis shows that the benefits of using multiple pivots with respect to the average comparison count are marginal and these strategies are inferior to simpler strategies such as the well known median-of-$k$ approach. A substantial part of the partitioning cost is caused by rearranging elements. A rigorous analysis of an algorithm for rearranging elements in the partitioning step is carried out, observing mainly how often array cells are accessed during partitioning. The algorithm behaves best if 3 to 5 pivots are used. Experiments show that this translates into good cache behavior and is closest to predicting observed running times of multi-pivot quicksort algorithms. Finally, it is studied how choosing pivots from a sample affects sorting cost. The study is theoretical in the sense that although the findings motivate design recommendations for multipivot quicksort algorithms that lead to running time improvements over known algorithms in an experimental setting, these improvements are small.
△ Less
Submitted 31 May, 2016; v1 submitted 15 October, 2015;
originally announced October 2015.
-
Optimal Partitioning for Dual-Pivot Quicksort
Authors:
Martin Aumüller,
Martin Dietzfelbinger
Abstract:
Dual-pivot quicksort refers to variants of classical quicksort where in the partitioning step two pivots are used to split the input into three segments. This can be done in different ways, giving rise to different algorithms. Recently, a dual-pivot algorithm proposed by Yaroslavskiy received much attention, because a variant of it replaced the well-engineered quicksort algorithm in Sun's Java 7 r…
▽ More
Dual-pivot quicksort refers to variants of classical quicksort where in the partitioning step two pivots are used to split the input into three segments. This can be done in different ways, giving rise to different algorithms. Recently, a dual-pivot algorithm proposed by Yaroslavskiy received much attention, because a variant of it replaced the well-engineered quicksort algorithm in Sun's Java 7 runtime library. Nebel and Wild (ESA 2012) analyzed this algorithm and showed that on average it uses 1.9n ln n + O(n) comparisons to sort an input of size n, beating standard quicksort, which uses 2n ln n + O(n) comparisons. We introduce a model that captures all dual-pivot algorithms, give a unified analysis, and identify new dual-pivot algorithms that minimize the average number of key comparisons among all possible algorithms up to a linear term. This minimum is 1.8n ln n + O(n). For the case that the pivots are chosen from a small sample, we include a comparison of dual-pivot quicksort and classical quicksort. Specifically, we show that dual-pivot quicksort benefits from a skewed choice of pivots. We experimentally evaluate our algorithms and compare them to Yaroslavskiy's algorithm and the recently described three-pivot quicksort algorithm of Kushagra et al. (ALENEX 2014).
△ Less
Submitted 13 October, 2015; v1 submitted 21 March, 2013;
originally announced March 2013.
-
Explicit and Efficient Hash Families Suffice for Cuckoo Hashing with a Stash
Authors:
Martin Aumüller,
Martin Dietzfelbinger,
Philipp Woelfel
Abstract:
It is shown that for cuckoo hashing with a stash as proposed by Kirsch, Mitzenmacher, and Wieder (2008) families of very simple hash functions can be used, maintaining the favorable performance guarantees: with stash size $s$ the probability of a rehash is $O(1/n^{s+1})$, and the evaluation time is $O(s)$. Instead of the full randomness needed for the analysis of Kirsch et al. and of Kutzelnigg (2…
▽ More
It is shown that for cuckoo hashing with a stash as proposed by Kirsch, Mitzenmacher, and Wieder (2008) families of very simple hash functions can be used, maintaining the favorable performance guarantees: with stash size $s$ the probability of a rehash is $O(1/n^{s+1})$, and the evaluation time is $O(s)$. Instead of the full randomness needed for the analysis of Kirsch et al. and of Kutzelnigg (2010) (resp. $Θ(\log n)$-wise independence for standard cuckoo hashing) the new approach even works with 2-wise independent hash families as building blocks. Both construction and analysis build upon the work of Dietzfelbinger and Woelfel (2003). The analysis, which can also be applied to the fully random case, utilizes a graph counting argument and is much simpler than previous proofs. As a byproduct, an algorithm for simulating uniform hashing is obtained. While it requires about twice as much space as the most space efficient solutions, it is attractive because of its simple and direct structure.
△ Less
Submitted 19 April, 2012;
originally announced April 2012.