Unsupervised Multi-Index Semantic Hashing

Hansen, Christian; Hansen, Casper; Simonsen, Jakob Grue; Alstrup, Stephen; Lioma, Christina

doi:10.1145/3442381.3450014

Computer Science > Information Retrieval

arXiv:2103.14460 (cs)

[Submitted on 26 Mar 2021]

Title:Unsupervised Multi-Index Semantic Hashing

Authors:Christian Hansen, Casper Hansen, Jakob Grue Simonsen, Stephen Alstrup, Christina Lioma

View PDF

Abstract:Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve similarity search effectiveness, while assuming a brute-force linear scan strategy for searching over all the hash codes, even though much faster alternatives exist. One such alternative is multi-index hashing, an approach that constructs a smaller candidate set to search over, which depending on the distribution of the hash codes can lead to sub-linear search time. In this work, we propose Multi-Index Semantic Hashing (MISH), an unsupervised hashing model that learns hash codes that are both effective and highly efficient by being optimized for multi-index hashing. We derive novel training objectives, which enable to learn hash codes that reduce the candidate sets produced by multi-index hashing, while being end-to-end trainable. In fact, our proposed training objectives are model agnostic, i.e., not tied to how the hash codes are generated specifically in MISH, and are straight-forward to include in existing and future semantic hashing models. We experimentally compare MISH to state-of-the-art semantic hashing baselines in the task of document similarity search. We find that even though multi-index hashing also improves the efficiency of the baselines compared to a linear scan, they are still upwards of 33% slower than MISH, while MISH is still able to obtain state-of-the-art effectiveness.

Comments:	Proceedings of the 2021 World Wide Web Conference, published under Creative Commons CC-BY 4.0 License
Subjects:	Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:	arXiv:2103.14460 [cs.IR]
	(or arXiv:2103.14460v1 [cs.IR] for this version)
	https://doi.org/10.48550/arXiv.2103.14460
Related DOI:	https://doi.org/10.1145/3442381.3450014

Submission history

From: Casper Hansen [view email]
[v1] Fri, 26 Mar 2021 13:33:48 UTC (10,010 KB)

Computer Science > Information Retrieval

Title:Unsupervised Multi-Index Semantic Hashing

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Information Retrieval

Title:Unsupervised Multi-Index Semantic Hashing

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators