-
Ultra-Quantisation: Efficient Embedding Search via 1.58-bit Encodings
Authors:
Richard Connor,
Alan Dearle,
Ben Claydon
Abstract:
Many modern search domains comprise high-dimensional vectors of floating point numbers derived from neural networks, in the form of embeddings. Typical embeddings range in size from hundreds to thousands of dimensions, making the size of the embeddings, and the speed of comparison, a significant issue.
Quantisation is a class of mechanism which replaces the floating point values with a smaller r…
▽ More
Many modern search domains comprise high-dimensional vectors of floating point numbers derived from neural networks, in the form of embeddings. Typical embeddings range in size from hundreds to thousands of dimensions, making the size of the embeddings, and the speed of comparison, a significant issue.
Quantisation is a class of mechanism which replaces the floating point values with a smaller representation, for example a short integer. This gives an approximation of the embedding space in return for a smaller data representation and a faster comparison function.
Here we take this idea almost to its extreme: we show how vectors of arbitrary-precision floating point values can be replaced by vectors whose elements are drawn from the set {-1,0,1}. This yields very significant savings in space and metric evaluation cost, while maintaining a strong correlation for similarity measurements.
This is achieved by way of a class of convex polytopes which exist in the high-dimensional space. In this article we give an outline description of these objects, and show how they can be used for the basis of such radical quantisation while maintaining a surprising degree of accuracy.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
nSimplex Zen: A Novel Dimensionality Reduction for Euclidean and Hilbert Spaces
Authors:
Richard Connor,
Lucia Vadicamo
Abstract:
Dimensionality reduction techniques map values from a high dimensional space to one with a lower dimension. The result is a space which requires less physical memory and has a faster distance calculation. These techniques are widely used where required properties of the reduced-dimension space give an acceptable accuracy with respect to the original space. Many such transforms have been described.…
▽ More
Dimensionality reduction techniques map values from a high dimensional space to one with a lower dimension. The result is a space which requires less physical memory and has a faster distance calculation. These techniques are widely used where required properties of the reduced-dimension space give an acceptable accuracy with respect to the original space. Many such transforms have been described. They have been classified in two main groups: linear and topological. Linear methods such as Principal Component Analysis (PCA) and Random Projection (RP) define matrix-based transforms into a lower dimension of Euclidean space. Topological methods such as Multidimensional Scaling (MDS) attempt to preserve higher-level aspects such as the nearest-neighbour relation, and some may be applied to non-Euclidean spaces. Here, we introduce nSimplex Zen, a novel topological method of reducing dimensionality. Like MDS, it relies only upon pairwise distances measured in the original space. The use of distances, rather than coordinates, allows the technique to be applied to both Euclidean and other Hilbert spaces, including those governed by Cosine, Jensen-Shannon and Quadratic Form distances. We show that in almost all cases, due to geometric properties of high-dimensional spaces, our new technique gives better properties than others, especially with reduction to very low dimensions.
△ Less
Submitted 13 February, 2024; v1 submitted 22 February, 2023;
originally announced February 2023.
-
A Ptolemaic Partitioning Mechanism
Authors:
Richard Connor
Abstract:
For many years, exact metric search relied upon the property of triangle inequality to give a lower bound on uncalculated distances. Two exclusion mechanisms derive from this property, generally known as pivot exclusion and hyperplane exclusion. These mechanisms work in any proper metric space and are the basis of many metric indexing mechanisms. More recently, the Ptolemaic and four-point lower b…
▽ More
For many years, exact metric search relied upon the property of triangle inequality to give a lower bound on uncalculated distances. Two exclusion mechanisms derive from this property, generally known as pivot exclusion and hyperplane exclusion. These mechanisms work in any proper metric space and are the basis of many metric indexing mechanisms. More recently, the Ptolemaic and four-point lower bound properties have been shown to give tighter bounds in some subclasses of metric space.
Both triangle inequality and the four-point lower bound directly imply straightforward partitioning mechanisms: that is, a method of dividing a finite space according to a fixed partition, in order that one or more classes of the partition can be eliminated from a search at query time. However, up to now, no partitioning principle has been identified for the Ptolemaic inequality, which has been used only as a filtering mechanism.
Here, a novel partitioning mechanism for the Ptolemaic lower bound is presented. It is always better than either pivot or hyperplane partitioning. While the exclusion condition itself is weaker than Hilbert (four-point) exclusion, its calculation is cheaper. Furthermore, it can be combined with Hilbert exclusion to give a new maximum for exclusion power with respect to the number of distances measured per query.
△ Less
Submitted 19 August, 2022;
originally announced August 2022.
-
High-Dimensional Simplexes for Supermetric Search
Authors:
Richard Connor,
Lucia Vadicamo,
Fausto Rabitti
Abstract:
In 1953, Blumenthal showed that every semi-metric space that is isometrically embeddable in a Hilbert space has the n-point property; we have previously called such spaces supermetric spaces. Although this is a strictly stronger property than triangle inequality, it is nonetheless closely related and many useful metric spaces possess it. These include Euclidean, Cosine and Jensen-Shannon spaces of…
▽ More
In 1953, Blumenthal showed that every semi-metric space that is isometrically embeddable in a Hilbert space has the n-point property; we have previously called such spaces supermetric spaces. Although this is a strictly stronger property than triangle inequality, it is nonetheless closely related and many useful metric spaces possess it. These include Euclidean, Cosine and Jensen-Shannon spaces of any dimension. A simple corollary of the n-point property is that, for any (n+1) objects sampled from the space, there exists an n-dimensional simplex in Euclidean space whose edge lengths correspond to the distances among the objects. We show how the construction of such simplexes in higher dimensions can be used to give arbitrarily tight lower and upper bounds on distances within the original space. This allows the construction of an n-dimensional Euclidean space, from which lower and upper bounds of the original space can be calculated, and which is itself an indexable space with the n-point property. For similarity search, the engineering tradeoffs are good: we show significant reductions in data size and metric cost with little loss of accuracy, leading to a significant overall improvement in search performance.
△ Less
Submitted 26 July, 2017;
originally announced July 2017.
-
Supermetric Search
Authors:
Richard Connor,
Lucia Vadicamo,
Franco Alberto Cardillo,
Fausto Rabitti
Abstract:
Metric search is concerned with the efficient evaluation of queries in metric spaces. In general,a large space of objects is arranged in such a way that, when a further object is presented as a query, those objects most similar to the query can be efficiently found. Most mechanisms rely upon the triangle inequality property of the metric governing the space. The triangle inequality property is equ…
▽ More
Metric search is concerned with the efficient evaluation of queries in metric spaces. In general,a large space of objects is arranged in such a way that, when a further object is presented as a query, those objects most similar to the query can be efficiently found. Most mechanisms rely upon the triangle inequality property of the metric governing the space. The triangle inequality property is equivalent to a finite embedding property, which states that any three points of the space can be isometrically embedded in two-dimensional Euclidean space. In this paper, we examine a class of semimetric space which is finitely four-embeddable in three-dimensional Euclidean space. In mathematics this property has been extensively studied and is generally known as the four-point property. All spaces with the four-point property are metric spaces, but they also have some stronger geometric guarantees. We coin the term supermetric space as, in terms of metric search, they are significantly more tractable. Supermetric spaces include all those governed by Euclidean, Cosine, Jensen-Shannon and Triangular distances, and are thus commonly used within many domains. In previous work we have given a generic mathematical basis for the supermetric property and shown how it can improve indexing performance for a given exact search structure. Here we present a full investigation into its use within a variety of different hyperplane partition indexing structures, and go on to show some more of its flexibility by examining a search structure whose partition and exclusion conditions are tailored, at each node, to suit the individual reference points and data set present there. Among the results given, we show a new best performance for exact search using a well-known benchmark.
△ Less
Submitted 22 October, 2017; v1 submitted 26 July, 2017;
originally announced July 2017.
-
Hilbert Exclusion: Improved Metric Search through Finite Isometric Embeddings
Authors:
Richard Connor,
Franco Alberto Cardillo,
Lucia Vadicamo,
Fausto Rabitti
Abstract:
Most research into similarity search in metric spaces relies upon the triangle inequality property. This property allows the space to be arranged according to relative distances to avoid searching some subspaces. We show that many common metric spaces, notably including those using Euclidean and Jensen-Shannon distances, also have a stronger property, sometimes called the four-point property: in e…
▽ More
Most research into similarity search in metric spaces relies upon the triangle inequality property. This property allows the space to be arranged according to relative distances to avoid searching some subspaces. We show that many common metric spaces, notably including those using Euclidean and Jensen-Shannon distances, also have a stronger property, sometimes called the four-point property: in essence, these spaces allow an isometric embedding of any four points in three-dimensional Euclidean space, as well as any three points in two-dimensional Euclidean space. In fact, we show that any space which is isometrically embeddable in Hilbert space has the stronger property. This property gives stronger geometric guarantees, and one in particular, which we name the Hilbert Exclusion property, allows any indexing mechanism which uses hyperplane partitioning to perform better. One outcome of this observation is that a number of state-of-the-art indexing mechanisms over high dimensional spaces can be easily extended to give a significant increase in performance; furthermore, the improvement given is greater in higher dimensions. This therefore leads to a significant improvement in the cost of metric search in these spaces.
△ Less
Submitted 28 April, 2016;
originally announced April 2016.
-
Genomics and Biological Big Data: Facing Current and Future Challenges around Data and Software Sharing and Reproducibility
Authors:
Sandra Gesing,
Thomas Richard Connor,
Ian Taylor
Abstract:
Novel technologies in genomics allow creating data in exascale dimension with relatively minor effort of human and laboratory and thus monetary resources compared to capabilities only a decade ago. While the availability of this data salvage to find answers for research questions, which would not have been feasible before, maybe even not feasible to ask before, the amount of data creates new chall…
▽ More
Novel technologies in genomics allow creating data in exascale dimension with relatively minor effort of human and laboratory and thus monetary resources compared to capabilities only a decade ago. While the availability of this data salvage to find answers for research questions, which would not have been feasible before, maybe even not feasible to ask before, the amount of data creates new challenges, which obviously need new software and data management systems. Such new solutions have to consider integrative approaches, which are not only considering the effectiveness and efficiency of data processing but improve reusability, reproducibility and usability especially tailored to the target user communities of genomic big data. In our opinion, current solutions tackle part of the challenges and have each their strengths but lack to provide a complete solution. We present in this paper the key challenges and the characteristics cutting-edge developments should possess for fulfilling the needs of the user communities to allow for seamless sharing and data analysis on a large scale.
△ Less
Submitted 9 November, 2015;
originally announced November 2015.
-
First Smart Spaces
Authors:
Graham Kirby,
Alan Dearle,
Andrew McCarthy,
Ron Morrison,
Kevin Mullen,
Yanyan Yang,
Richard Connor,
Paula Welen,
Andy Wilson
Abstract:
This document describes the Gloss software currently implemented. The description of the Gloss demonstrator for multi-surface interaction can be found in D17. The ongoing integration activity for the work described in D17 and D8 constitutes our development of infrastructure for a first smart space. In this report, the focus is on infrastructure to support the implementation of location aware servi…
▽ More
This document describes the Gloss software currently implemented. The description of the Gloss demonstrator for multi-surface interaction can be found in D17. The ongoing integration activity for the work described in D17 and D8 constitutes our development of infrastructure for a first smart space. In this report, the focus is on infrastructure to support the implementation of location aware services. A local architecture provides a framework for constructing Gloss applications, termed assemblies, that run on individual physical nodes. A global architecture defines an overlay network for linking individual assemblies. Both local and global architectures are under active development.
△ Less
Submitted 1 July, 2010;
originally announced July 2010.
-
Architectural Support for Global Smart Spaces
Authors:
Alan Dearle,
Graham Kirby,
Ron Morrison,
Andrew McCarthy,
Kevin Mullen,
Yanyan Yang,
Richard Connor,
Paula Welen,
Andy Wilson
Abstract:
A GLObal Smart Space (GLOSS) provides support for interaction amongst people, artefacts and places while taking account of both context and movement on a global scale. Crucial to the definition of a GLOSS is the provision of a set of location-aware services that detect, convey, store and exploit location information. We use one of these services, hearsay, to illustrate the implementation dimension…
▽ More
A GLObal Smart Space (GLOSS) provides support for interaction amongst people, artefacts and places while taking account of both context and movement on a global scale. Crucial to the definition of a GLOSS is the provision of a set of location-aware services that detect, convey, store and exploit location information. We use one of these services, hearsay, to illustrate the implementation dimensions of a GLOSS. The focus of the paper is on both local and global software architecture to support the implementation of such services. The local architecture is based on XML pipe-lines and is used to construct location-aware components. The global architecture is based on a hybrid peer-to-peer routing scheme and provides the local architectures with the means to communicate in the global context.
△ Less
Submitted 24 June, 2010;
originally announced June 2010.
-
Active Architecture for Pervasive Contextual Services
Authors:
Graham Kirby,
Alan Dearle,
Ron Morrison,
Mark Dunlop,
Richard Connor,
Paddy Nixon
Abstract:
Pervasive services may be defined as services that are available "to any client (anytime, anywhere)". Here we focus on the software and network infrastructure required to support pervasive contextual services operating over a wide area. One of the key requirements is a matching service capable of as-similating and filtering information from various sources and determining matches relevant to those…
▽ More
Pervasive services may be defined as services that are available "to any client (anytime, anywhere)". Here we focus on the software and network infrastructure required to support pervasive contextual services operating over a wide area. One of the key requirements is a matching service capable of as-similating and filtering information from various sources and determining matches relevant to those services. We consider some of the challenges in engineering a globally distributed matching service that is scalable, manageable, and able to evolve incrementally as usage patterns, data formats, services, network topologies and deployment technologies change. We outline an approach based on the use of a peer-to-peer architecture to distribute user events and data, and to support the deployment and evolution of the infrastructure itself.
△ Less
Submitted 24 June, 2010;
originally announced June 2010.