DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers

Ma, Xueguang; Lin, Xi Victoria; Oguz, Barlas; Lin, Jimmy; Yih, Wen-tau; Chen, Xilun

Computer Science > Computation and Language

arXiv:2502.18460 (cs)

[Submitted on 25 Feb 2025 (v1), last revised 3 Jun 2025 (this version, v2)]

Title:DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers

Authors:Xueguang Ma, Xi Victoria Lin, Barlas Oguz, Jimmy Lin, Wen-tau Yih, Xilun Chen

View PDF HTML (experimental)

Abstract:Large language models (LLMs) have demonstrated strong effectiveness and robustness while fine-tuned as dense retrievers. However, their large parameter size brings significant inference time computational challenges, including high encoding costs for large-scale corpora and increased query latency, limiting their practical deployment. While smaller retrievers offer better efficiency, they often fail to generalize effectively with limited supervised fine-tuning data. In this work, we introduce DRAMA, a training framework that leverages LLMs to train smaller generalizable dense retrievers. In particular, we adopt pruned LLMs as the backbone and train on diverse LLM-augmented data in a single-stage contrastive learning setup. Experiments show that DRAMA offers better multilingual and long-context capabilities than traditional encoder-based retrievers, and achieves strong performance across multiple tasks and languages. These highlight the potential of connecting the training of smaller retrievers with the growing advancements in LLMs, bridging the gap between efficiency and generalization.

Comments:	ACL 2025
Subjects:	Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:	arXiv:2502.18460 [cs.CL]
	(or arXiv:2502.18460v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.18460

Submission history

From: Xueguang Ma [view email]
[v1] Tue, 25 Feb 2025 18:59:07 UTC (976 KB)
[v2] Tue, 3 Jun 2025 17:47:36 UTC (365 KB)

Computer Science > Computation and Language

Title:DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators