Skip to main content

Showing 1–50 of 74 results for author: MacAvaney, S

.
  1. arXiv:2505.21058  [pdf, ps, other

    cs.IR

    Disentangling Locality and Entropy in Ranking Distillation

    Authors: Andrew Parry, Debasis Ganguly, Sean MacAvaney

    Abstract: The training process of ranking models involves two key data selection decisions: a sampling strategy, and a labeling strategy. Modern ranking systems, especially those for performing semantic search, typically use a ``hard negative'' sampling strategy to identify challenging items using heuristics and a distillation labeling strategy to transfer ranking "knowledge" from a more capable model. In p… ▽ More

    Submitted 27 May, 2025; originally announced May 2025.

    Comments: 9 pages, 2 figures, 3 tables, 9 page appendix with 2 figures and 2 tables

  2. arXiv:2505.15070  [pdf, ps, other

    cs.IR cs.CL

    An Alternative to FLOPS Regularization to Effectively Productionize SPLADE-Doc

    Authors: Aldo Porco, Dhruv Mehra, Igor Malioutov, Karthik Radhakrishnan, Moniba Keymanesh, Daniel Preoţiuc-Pietro, Sean MacAvaney, Pengxiang Cheng

    Abstract: Learned Sparse Retrieval (LSR) models encode text as weighted term vectors, which need to be sparse to leverage inverted index structures during retrieval. SPLADE, the most popular LSR model, uses FLOPS regularization to encourage vector sparsity during training. However, FLOPS regularization does not ensure sparsity among terms - only within a given query or document. Terms with very high Documen… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

    Comments: Accepted as a short paper at SIGIR 2025

  3. arXiv:2505.08411  [pdf

    cs.IR

    Lost in Transliteration: Bridging the Script Gap in Neural IR

    Authors: Andreas Chari, Iadh Ounis, Sean MacAvaney

    Abstract: Most human languages use scripts other than the Latin alphabet. Search users in these languages often formulate their information needs in a transliterated -- usually Latinized -- form for ease of typing. For example, Greek speakers might use Greeklish, and Arabic speakers might use Arabizi. This paper shows that current search systems, including those that use multilingual dense embeddings such a… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: 6 pages, 2 tables. paper accepted at the Short Paper track of The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval

  4. Artifact Sharing for Information Retrieval Research

    Authors: Sean MacAvaney

    Abstract: Sharing artifacts -- such as trained models, pre-built indexes, and the code to use them -- aids in reproducibility efforts by allowing researchers to validate intermediate steps and improves the sustainability of research by allowing multiple groups to build off one another's prior computational work. Although there are de facto consensuses on how to share research code (through a git repository… ▽ More

    Submitted 8 May, 2025; originally announced May 2025.

    Comments: SIGIR 2025 (demo)

  5. arXiv:2504.11011  [pdf, other

    cs.IR cs.AI

    Document Quality Scoring for Web Crawling

    Authors: Francesca Pezzuti, Ariane Mueller, Sean MacAvaney, Nicola Tonellotto

    Abstract: The internet contains large amounts of low-quality content, yet users expect web search engines to deliver high-quality, relevant results. The abundant presence of low-quality pages can negatively impact retrieval and crawling processes by wasting resources on these documents. Therefore, search engines can greatly benefit from techniques that leverage efficient quality estimation methods to mitiga… ▽ More

    Submitted 15 April, 2025; originally announced April 2025.

    Comments: Presented at WOWS2025

  6. MURR: Model Updating with Regularized Replay for Searching a Document Stream

    Authors: Eugene Yang, Nicola Tonellotto, Dawn Lawrie, Sean MacAvaney, James Mayfield, Douglas W. Oard, Scott Miller

    Abstract: The Internet produces a continuous stream of new documents and user-generated queries. These naturally change over time based on events in the world and the evolution of language. Neural retrieval models that were trained once on a fixed set of query-document pairs will quickly start misrepresenting newly-created content and queries, leading to less effective retrieval. Traditional statistical spa… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: Published at ECIR 2025. 16 pages, 4 figures

  7. arXiv:2504.09984  [pdf, other

    cs.IR

    On Precomputation and Caching in Information Retrieval Experiments with Pipeline Architectures

    Authors: Sean MacAvaney, Craig Macdonald

    Abstract: Modern information retrieval systems often rely on multiple components executed in a pipeline. In a research setting, this can lead to substantial redundant computations (e.g., retrieving the same query multiple times for evaluating different downstream rerankers). To overcome this, researchers take cached "result" files as inputs, which represent the output of another pipeline. However, these res… ▽ More

    Submitted 14 April, 2025; originally announced April 2025.

    Comments: WOWS @ ECIR 2025

  8. arXiv:2504.09353  [pdf, other

    cs.IR

    Breaking the Lens of the Telescope: Online Relevance Estimation over Large Retrieval Sets

    Authors: Mandeep Rathee, V Venktesh, Sean MacAvaney, Avishek Anand

    Abstract: Advanced relevance models, such as those that use large language models (LLMs), provide highly accurate relevance estimations. However, their computational costs make them infeasible for processing large document corpora. To address this, retrieval systems often employ a telescoping approach, where computationally efficient but less precise lexical and semantic retrievers filter potential candidat… ▽ More

    Submitted 7 May, 2025; v1 submitted 12 April, 2025; originally announced April 2025.

    Comments: Accepted for publication at SIGIR'25 . 11 pages,5 figures, 4 tables

  9. GRIT: Graph-based Recall Improvement for Task-oriented E-commerce Queries

    Authors: Hrishikesh Kulkarni, Surya Kallumadi, Sean MacAvaney, Nazli Goharian, Ophir Frieder

    Abstract: Many e-commerce search pipelines have four stages, namely: retrieval, filtering, ranking, and personalized-reranking. The retrieval stage must be efficient and yield high recall because relevant products missed in the first stage cannot be considered in later stages. This is challenging for task-oriented queries (queries with actionable intent) where user requirements are contextually intensive an… ▽ More

    Submitted 16 February, 2025; originally announced April 2025.

    Comments: LLM4ECommerce at WWW 2025

    Journal ref: Companion Proceedings of the ACM Web Conference 2025 (WWW Companion 25), April 28-May 2, 2025, Sydney, NSW, Australia. ACM, New York, NY, USA, 10 pages

  10. arXiv:2504.01818  [pdf, other

    cs.IR cs.CL

    Efficient Constant-Space Multi-Vector Retrieval

    Authors: Sean MacAvaney, Antonio Mallia, Nicola Tonellotto

    Abstract: Multi-vector retrieval methods, exemplified by the ColBERT architecture, have shown substantial promise for retrieval by providing strong trade-offs in terms of retrieval latency and effectiveness. However, they come at a high cost in terms of storage since a (potentially compressed) vector needs to be stored for every token in the input collection. To overcome this issue, we propose encoding docu… ▽ More

    Submitted 2 April, 2025; originally announced April 2025.

    Comments: ECIR 2025

  11. arXiv:2503.22672  [pdf, ps, other

    cs.IR cs.AI

    Exploring the Effectiveness of Multi-stage Fine-tuning for Cross-encoder Re-rankers

    Authors: Francesca Pezzuti, Sean MacAvaney, Nicola Tonellotto

    Abstract: State-of-the-art cross-encoders can be fine-tuned to be highly effective in passage re-ranking. The typical fine-tuning process of cross-encoders as re-rankers requires large amounts of manually labelled data, a contrastive learning objective, and a set of heuristically sampled negatives. An alternative recent approach for fine-tuning instead involves teaching the model to mimic the rankings of a… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

    Comments: 7 pages. To be published as short paper in the Proceedings of the European Conference on Information Retrieval (ECIR) 2025

  12. arXiv:2503.22508  [pdf, other

    cs.IR

    Improving Low-Resource Retrieval Effectiveness using Zero-Shot Linguistic Similarity Transfer

    Authors: Andreas Chari, Sean MacAvaney, Iadh Ounis

    Abstract: Globalisation and colonisation have led the vast majority of the world to use only a fraction of languages, such as English and French, to communicate, excluding many others. This has severely affected the survivability of many now-deemed vulnerable or endangered languages, such as Occitan and Sicilian. These languages often share some characteristics, such as elements of their grammar and lexicon… ▽ More

    Submitted 28 March, 2025; originally announced March 2025.

    Comments: 12 Pages, 5 Figures, 2 Tables, Full Paper accepted at IR4GOOD track in ECIR 2025

  13. Variations in Relevance Judgments and the Shelf Life of Test Collections

    Authors: Andrew Parry, Maik Fröbe, Harrisen Scells, Ferdinand Schlatt, Guglielmo Faggioli, Saber Zerhoudi, Sean MacAvaney, Eugene Yang

    Abstract: The fundamental property of Cranfield-style evaluations, that system rankings are stable even when assessors disagree on individual relevance decisions, was validated on traditional test collections. However, the paradigm shift towards neural retrieval models affected the characteristics of modern test collections, e.g., documents are short, judged with four grades of relevance, and information ne… ▽ More

    Submitted 21 May, 2025; v1 submitted 28 February, 2025; originally announced February 2025.

    Comments: 11 pages, 6 tables, 5 figures, Accepted to SIGIR 2025

  14. arXiv:2501.19264  [pdf, other

    cs.IR cs.CL cs.LG

    mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval

    Authors: Orion Weller, Benjamin Chang, Eugene Yang, Mahsa Yarmohammadi, Sam Barham, Sean MacAvaney, Arman Cohan, Luca Soldaini, Benjamin Van Durme, Dawn Lawrie

    Abstract: Retrieval systems generally focus on web-style queries that are short and underspecified. However, advances in language models have facilitated the nascent rise of retrieval models that can understand more complex queries with diverse intents. However, these efforts have focused exclusively on English; therefore, we do not yet understand how they work across languages. We introduce mFollowIR, a mu… ▽ More

    Submitted 31 January, 2025; originally announced January 2025.

    Comments: Accepted to ECIR 2025

  15. arXiv:2501.10165  [pdf, other

    cs.IR

    MechIR: A Mechanistic Interpretability Framework for Information Retrieval

    Authors: Andrew Parry, Catherine Chen, Carsten Eickhoff, Sean MacAvaney

    Abstract: Mechanistic interpretability is an emerging diagnostic approach for neural models that has gained traction in broader natural language processing domains. This paradigm aims to provide attribution to components of neural systems where causal relationships between hidden layers and output were previously uninterpretable. As the use of neural models in IR for retrieval and evaluation becomes ubiquit… ▽ More

    Submitted 17 January, 2025; originally announced January 2025.

    Comments: 5 pages, 2 figures, Accepted to ECIR 2025 as a Demo Paper

  16. arXiv:2501.09186  [pdf, other

    cs.IR cs.AI

    Guiding Retrieval using LLM-based Listwise Rankers

    Authors: Mandeep Rathee, Sean MacAvaney, Avishek Anand

    Abstract: Large Language Models (LLMs) have shown strong promise as rerankers, especially in ``listwise'' settings where an LLM is prompted to rerank several search results at once. However, this ``cascading'' retrieve-and-rerank approach is limited by the bounded recall problem: relevant documents not retrieved initially are permanently excluded from the final ranking. Adaptive retrieval techniques address… ▽ More

    Submitted 15 January, 2025; originally announced January 2025.

    Comments: 16 pages, 2 figures, 3 tables

  17. arXiv:2411.02284  [pdf, other

    cs.IR

    Training on the Test Model: Contamination in Ranking Distillation

    Authors: Vishakha Suresh Kalal, Andrew Parry, Sean MacAvaney

    Abstract: Neural approaches to ranking based on pre-trained language models are highly effective in ad-hoc search. However, the computational expense of these models can limit their application. As such, a process known as knowledge distillation is frequently applied to allow a smaller, efficient model to learn from an effective but expensive model. A key example of this is the distillation of expensive API… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

    Comments: 4 pages

  18. Quam: Adaptive Retrieval through Query Affinity Modelling

    Authors: Mandeep Rathee, Sean MacAvaney, Avishek Anand

    Abstract: Building relevance models to rank documents based on user information needs is a central task in information retrieval and the NLP community. Beyond the direct ad-hoc search setting, many knowledge-intense tasks are powered by a first-stage retrieval stage for context selection, followed by a more involved task-specific model. However, most first-stage ranking stages are inherently limited by the… ▽ More

    Submitted 26 October, 2024; originally announced October 2024.

    Comments: 15 pages, 10 figures

  19. arXiv:2410.07722  [pdf, other

    cs.IR

    DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities

    Authors: Thong Nguyen, Shubham Chatterjee, Sean MacAvaney, Iain Mackie, Jeff Dalton, Andrew Yates

    Abstract: Learned Sparse Retrieval (LSR) models use vocabularies from pre-trained transformers, which often split entities into nonsensical fragments. Splitting entities can reduce retrieval accuracy and limits the model's ability to incorporate up-to-date world knowledge not included in the training data. In this work, we enhance the LSR vocabulary with Wikipedia concepts and entities, enabling the model t… ▽ More

    Submitted 15 October, 2024; v1 submitted 10 October, 2024; originally announced October 2024.

    Comments: https://github.com/thongnt99/DyVo

    Journal ref: EMNLP 2024

  20. LexBoost: Improving Lexical Document Retrieval with Nearest Neighbors

    Authors: Hrishikesh Kulkarni, Nazli Goharian, Ophir Frieder, Sean MacAvaney

    Abstract: Sparse retrieval methods like BM25 are based on lexical overlap, focusing on the surface form of the terms that appear in the query and the document. The use of inverted indices in these methods leads to high retrieval efficiency. On the other hand, dense retrieval methods are based on learned dense vectors and, consequently, are effective but comparatively slow. Since sparse and dense methods app… ▽ More

    Submitted 25 August, 2024; originally announced September 2024.

    Comments: ACM DocEng 2024

    Journal ref: ACM Symposium on Document Engineering 2024 (DocEng '24), August 20-23, 2024, San Jose, CA, USA. ACM, New York, NY, USA

  21. arXiv:2409.00085  [pdf, other

    cs.CL cs.IR

    Genetic Approach to Mitigate Hallucination in Generative IR

    Authors: Hrishikesh Kulkarni, Nazli Goharian, Ophir Frieder, Sean MacAvaney

    Abstract: Generative language models hallucinate. That is, at times, they generate factually flawed responses. These inaccuracies are particularly insidious because the responses are fluent and well-articulated. We focus on the task of Grounded Answer Generation (part of Generative IR), which aims to produce direct answers to a user's question based on results retrieved from a search engine. We address hall… ▽ More

    Submitted 25 August, 2024; originally announced September 2024.

    Comments: Gen-IR@SIGIR 2024

    Journal ref: The Second Workshop on Generative Information Retrieval at ACM SIGIR 2024

  22. Neural Passage Quality Estimation for Static Pruning

    Authors: Xuejun Chang, Debabrata Mishra, Craig Macdonald, Sean MacAvaney

    Abstract: Neural networks -- especially those that use large, pre-trained language models -- have improved search engines in various ways. Most prominently, they can estimate the relevance of a passage or document to a user's query. In this work, we depart from this direction by exploring whether neural networks can effectively predict which of a document's passages are unlikely to be relevant to any query… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: SIGIR 2024

  23. arXiv:2405.14589  [pdf, other

    cs.IR

    Top-Down Partitioning for Efficient List-Wise Ranking

    Authors: Andrew Parry, Sean MacAvaney, Debasis Ganguly

    Abstract: Large Language Models (LLMs) have significantly impacted many facets of natural language processing and information retrieval. Unlike previous encoder-based approaches, the enlarged context window of these generative models allows for ranking multiple documents at once, commonly called list-wise ranking. However, there are still limits to the number of documents that can be ranked in a single infe… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: 16 pages, 3 figures, 2 tables

  24. arXiv:2405.01122  [pdf, other

    cs.IR

    Generative Relevance Feedback and Convergence of Adaptive Re-Ranking: University of Glasgow Terrier Team at TREC DL 2023

    Authors: Andrew Parry, Thomas Jaenich, Sean MacAvaney, Iadh Ounis

    Abstract: This paper describes our participation in the TREC 2023 Deep Learning Track. We submitted runs that apply generative relevance feedback from a large language model in both a zero-shot and pseudo-relevance feedback setting over two sparse retrieval approaches, namely BM25 and SPLADE. We couple this first stage with adaptive re-ranking over a BM25 corpus graph scored using a monoELECTRA cross-encode… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: 5 pages, 5 figures, TREC Deep Learning 2023 Notebook

  25. On the Evaluation of Machine-Generated Reports

    Authors: James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, Noah Hibbler

    Abstract: Large Language Models (LLMs) have enabled new ways to satisfy information needs. Although great strides have been made in applying them to settings like document ranking and short-form text generation, they still struggle to compose complete, accurate, and verifiable long-form reports. Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of… ▽ More

    Submitted 9 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: 12 pages, 4 figures, accepted at SIGIR 2024 as perspective paper

  26. Exploiting Positional Bias for Query-Agnostic Generative Content in Search

    Authors: Andrew Parry, Sean MacAvaney, Debasis Ganguly

    Abstract: In recent years, neural ranking models (NRMs) have been shown to substantially outperform their lexical counterparts in text retrieval. In traditional search pipelines, a combination of features leads to well-defined behaviour. However, as neural approaches become increasingly prevalent as the final scoring component of engines or as standalone systems, their robustness to malicious text and, more… ▽ More

    Submitted 9 October, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: 8 pages, 4 main figures, 7 appendix pages, 2 appendix figures, Accepted to ACL 2024 Findings

  27. A Reproducibility Study of PLAID

    Authors: Sean MacAvaney, Nicola Tonellotto

    Abstract: The PLAID (Performance-optimized Late Interaction Driver) algorithm for ColBERTv2 uses clustered term representations to retrieve and progressively prune documents for final (exact) document scoring. In this paper, we reproduce and fill in missing gaps from the original work. By studying the parameters PLAID introduces, we find that its Pareto frontier is formed of a careful balance among its thre… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: SIGIR 2024 (reproducibility track)

  28. arXiv:2404.08071  [pdf, other

    cs.IR

    Overview of the TREC 2023 NeuCLIR Track

    Authors: Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang

    Abstract: The principal goal of the TREC Neural Cross-Language Information Retrieval (NeuCLIR) track is to study the impact of neural approaches to cross-language information retrieval. The track has created four collections, large collections of Chinese, Persian, and Russian newswire and a smaller collection of Chinese scientific abstracts. The principal tasks are ranked retrieval of news in one of the thr… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: 27 pages, 17 figures. Part of the TREC 2023 Proceedings

  29. Shallow Cross-Encoders for Low-Latency Retrieval

    Authors: Aleksandr V. Petrov, Sean MacAvaney, Craig Macdonald

    Abstract: Transformer-based Cross-Encoders achieve state-of-the-art effectiveness in text retrieval. However, Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. However, keeping search latencies low is important for user satisfaction and energy usage. In this pape… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

    Comments: Accepted by ECIR2024

  30. arXiv:2403.15246  [pdf, other

    cs.IR cs.CL cs.LG

    FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

    Authors: Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, Luca Soldaini

    Abstract: Modern Language Models (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, w… ▽ More

    Submitted 7 May, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

  31. arXiv:2403.07654  [pdf, other

    cs.IR

    Analyzing Adversarial Attacks on Sequence-to-Sequence Relevance Models

    Authors: Andrew Parry, Maik Fröbe, Sean MacAvaney, Martin Potthast, Matthias Hagen

    Abstract: Modern sequence-to-sequence relevance models like monoT5 can effectively capture complex textual interactions between queries and documents through cross-encoding. However, the use of natural language tokens in prompts, such as Query, Document, and Relevant for monoT5, opens an attack vector for malicious documents to manipulate their relevance score through prompt injection, e.g., by adding targe… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 13 pages, 3 figures, Accepted at ECIR 2024 as a Full Paper

  32. arXiv:2403.01981  [pdf, other

    cs.IR

    Evaluating the Explainability of Neural Rankers

    Authors: Saran Pandian, Debasis Ganguly, Sean MacAvaney

    Abstract: Information retrieval models have witnessed a paradigm shift from unsupervised statistical approaches to feature-based supervised approaches to completely data-driven ones that make use of the pre-training of large language models. While the increasing complexity of the search models have been able to demonstrate improvements in effectiveness (measured in terms of relevance of top-retrieved result… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

  33. arXiv:2401.11198  [pdf, other

    cs.IR

    A Deep Learning Approach for Selective Relevance Feedback

    Authors: Suchana Datta, Debasis Ganguly, Sean MacAvaney, Derek Greene

    Abstract: Pseudo-relevance feedback (PRF) can enhance average retrieval effectiveness over a sufficiently large number of queries. However, PRF often introduces a drift into the original information need, thus hurting the retrieval effectiveness of several queries. While a selective application of PRF can potentially alleviate this issue, previous approaches have largely relied on unsupervised or feature-ba… ▽ More

    Submitted 20 January, 2024; originally announced January 2024.

  34. On the Effects of Regional Spelling Conventions in Retrieval Models

    Authors: Andreas Chari, Sean MacAvaney, Iadh Ounis

    Abstract: One advantage of neural ranking models is that they are meant to generalise well in situations of synonymity i.e. where two words have similar or identical meanings. In this paper, we investigate and quantify how well various ranking models perform in a clear-cut case of synonymity: when words are simply expressed in different surface forms due to regional differences in spelling conventions (e.g.… ▽ More

    Submitted 1 August, 2023; originally announced August 2023.

    Comments: 10 pages, 3 tables, short paper published in SIGIR '23

    Journal ref: SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2023, Pages 2220-2224

  35. arXiv:2308.00415  [pdf, other

    cs.IR

    Generative Query Reformulation for Effective Adhoc Search

    Authors: Xiao Wang, Sean MacAvaney, Craig Macdonald, Iadh Ounis

    Abstract: Performing automatic reformulations of a user's query is a popular paradigm used in information retrieval (IR) for improving effectiveness -- as exemplified by the pseudo-relevance feedback approaches, which expand the query in order to alleviate the vocabulary mismatch problem. Recent advancements in generative language models have demonstrated their ability in generating responses that are relev… ▽ More

    Submitted 1 August, 2023; originally announced August 2023.

    Comments: Accepted to Gen-IR@SIGIR2023 Workshop

  36. Lexically-Accelerated Dense Retrieval

    Authors: Hrishikesh Kulkarni, Sean MacAvaney, Nazli Goharian, Ophir Frieder

    Abstract: Retrieval approaches that score documents based on learned dense vectors (i.e., dense retrieval) rather than lexical signals (i.e., conventional retrieval) are increasingly popular. Their ability to identify related documents that do not necessarily contain the same terms as those appearing in the user's query (thereby improving recall) is one of their key advantages. However, to actually achieve… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: SIGIR 2023

  37. arXiv:2306.17082  [pdf, other

    cs.IR

    Adaptive Latent Entity Expansion for Document Retrieval

    Authors: Iain Mackie, Shubham Chatterjee, Sean MacAvaney, Jeffrey Dalton

    Abstract: Despite considerable progress in neural relevance ranking techniques, search engines still struggle to process complex queries effectively - both in terms of precision and recall. Sparse and dense Pseudo-Relevance Feedback (PRF) approaches have the potential to overcome limitations in recall, but are only effective with high precision in the top ranks. In this work, we tackle the problem of search… ▽ More

    Submitted 4 December, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

  38. arXiv:2306.09657  [pdf, other

    cs.IR cs.CL

    Online Distillation for Pseudo-Relevance Feedback

    Authors: Sean MacAvaney, Xi Wang

    Abstract: Model distillation has emerged as a prominent technique to improve neural search models. To date, distillation taken an offline approach, wherein a new neural model is trained to predict relevance scores between arbitrary queries and documents. In this paper, we explore a departure from this offline distillation strategy by investigating whether a model for a specific query can be effectively dist… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

  39. The Information Retrieval Experiment Platform

    Authors: Maik Fröbe, Jan Heinrich Reimer, Sean MacAvaney, Niklas Deckers, Simon Reich, Janek Bevendorff, Benno Stein, Matthias Hagen, Martin Potthast

    Abstract: We integrate ir_datasets, ir_measures, and PyTerrier with TIRA in the Information Retrieval Experiment Platform (TIREx) to promote more standardized, reproducible, scalable, and even blinded retrieval experiments. Standardization is achieved when a retrieval approach implements PyTerrier's interfaces and the input and output of an experiment are compatible with ir_datasets and ir_measures. However… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: 11 pages. To be published in the proceedings of SIGIR 2023

  40. arXiv:2305.18494  [pdf, other

    cs.IR cs.LG

    Adapting Learned Sparse Retrieval for Long Documents

    Authors: Thong Nguyen, Sean MacAvaney, Andrew Yates

    Abstract: Learned sparse retrieval (LSR) is a family of neural retrieval methods that transform queries and documents into sparse weight vectors aligned with a vocabulary. While LSR approaches like Splade work well for short passages, it is unclear how well they handle longer documents. We investigate existing aggregation approaches for adapting LSR to longer documents and find that proximal scoring is cruc… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: SIGIR 2023

    Journal ref: SIGIR 2023

  41. arXiv:2304.12367  [pdf, other

    cs.IR

    Overview of the TREC 2022 NeuCLIR Track

    Authors: Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang

    Abstract: This is the first year of the TREC Neural CLIR (NeuCLIR) track, which aims to study the impact of neural approaches to cross-language information retrieval. The main task in this year's track was ad hoc ranked retrieval of Chinese, Persian, or Russian newswire documents using queries expressed in English. Topics were developed using standard TREC processes, except that topics developed by an annot… ▽ More

    Submitted 24 September, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

    Comments: 22 pages, 13 figures, 10 tables. Part of the Thirty-First Text REtrieval Conference (TREC 2022) Proceedings. Replace the misplaced Russian result table

  42. arXiv:2303.13416  [pdf, other

    cs.IR

    A Unified Framework for Learned Sparse Retrieval

    Authors: Thong Nguyen, Sean MacAvaney, Andrew Yates

    Abstract: Learned sparse retrieval (LSR) is a family of first-stage retrieval methods that are trained to generate sparse lexical representations of queries and documents for use with an inverted index. Many LSR methods have been recently introduced, with Splade models achieving state-of-the-art performance on MSMarco. Despite similarities in their model architectures, many LSR methods show substantial diff… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Journal ref: ECIR 2023

  43. One-Shot Labeling for Automatic Relevance Estimation

    Authors: Sean MacAvaney, Luca Soldaini

    Abstract: Dealing with unjudged documents ("holes") in relevance assessments is a perennial problem when evaluating search systems with offline experiments. Holes can reduce the apparent effectiveness of retrieval systems during evaluation and introduce biases in models trained with incomplete data. In this work, we explore whether large language models can help us fill such holes to improve offline evaluat… ▽ More

    Submitted 11 July, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: SIGIR 2023

  44. arXiv:2301.03266  [pdf, other

    cs.IR

    Doc2Query--: When Less is More

    Authors: Mitko Gospodinov, Sean MacAvaney, Craig Macdonald

    Abstract: Doc2Query -- the process of expanding the content of a document before indexing using a sequence-to-sequence model -- has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to "hallucinating" content that is not present in the source text. We argue that Doc2Query is indeed prone to hal… ▽ More

    Submitted 27 February, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

    Comments: ECIR 2023

  45. Adaptive Re-Ranking with a Corpus Graph

    Authors: Sean MacAvaney, Nicola Tonellotto, Craig Macdonald

    Abstract: Search systems often employ a re-ranking pipeline, wherein documents (or passages) from an initial pool of candidates are assigned new ranking scores. The process enables the use of highly-effective but expensive scoring functions that are not suitable for use directly in structures like inverted indices or approximate nearest neighbour indices. However, re-ranking pipelines are inherently limited… ▽ More

    Submitted 18 August, 2022; originally announced August 2022.

    Comments: CIKM 2022

  46. arXiv:2205.04546  [pdf, other

    cs.IR

    CODEC: Complex Document and Entity Collection

    Authors: Iain Mackie, Paul Owoicho, Carlos Gemmell, Sophie Fischer, Sean MacAvaney, Jeffrey Dalton

    Abstract: CODEC is a document and entity ranking benchmark that focuses on complex research topics. We target essay-style information needs of social science researchers, i.e. "How has the UK's Open Banking Regulation benefited Challenger Banks?". CODEC includes 42 topics developed by researchers and a new focused web corpus with semantic annotations including entity links. This resource includes expert jud… ▽ More

    Submitted 17 May, 2022; v1 submitted 9 May, 2022; originally announced May 2022.

    Comments: 10 pages, SIGIR 2022 Preprint

    ACM Class: H.3.3

  47. On Survivorship Bias in MS MARCO

    Authors: Prashansa Gupta, Sean MacAvaney

    Abstract: Survivorship bias is the tendency to concentrate on the positive outcomes of a selection process and overlook the results that generate negative outcomes. We observe that this bias could be present in the popular MS MARCO dataset, given that annotators could not find answers to 38--45% of the queries, leading to these queries being discarded in training and evaluation processes. Although we find t… ▽ More

    Submitted 27 April, 2022; originally announced April 2022.

    Comments: SIGIR 2022

  48. arXiv:2201.08622  [pdf, other

    cs.IR

    Reproducing Personalised Session Search over the AOL Query Log

    Authors: Sean MacAvaney, Craig Macdonald, Iadh Ounis

    Abstract: Despite its troubled past, the AOL Query Log continues to be an important resource to the research community -- particularly for tasks like search personalisation. When using the query log these ranking experiments, little attention is usually paid to the document corpus. Recent work typically uses a corpus containing versions of the documents collected long after the log was produced. Given that… ▽ More

    Submitted 21 January, 2022; originally announced January 2022.

    Comments: ECIR 2022 (reproducibility)

  49. arXiv:2111.13466  [pdf, ps, other

    cs.IR

    Streamlining Evaluation with ir-measures

    Authors: Sean MacAvaney, Craig Macdonald, Iadh Ounis

    Abstract: We present ir-measures, a new tool that makes it convenient to calculate a diverse set of evaluation measures used in information retrieval. Rather than implementing its own measure calculations, ir-measures provides a common interface to a handful of evaluation tools. The necessary tools are automatically invoked (potentially multiple times) to calculate all the desired metrics, simplifying the e… ▽ More

    Submitted 26 November, 2021; originally announced November 2021.

    Comments: ECIR 2022 (demo)

  50. arXiv:2108.13810  [pdf, other

    cs.LG cs.AI stat.ML

    Max-Utility Based Arm Selection Strategy For Sequential Query Recommendations

    Authors: Shameem A. Puthiya Parambath, Christos Anagnostopoulos, Roderick Murray-Smith, Sean MacAvaney, Evangelos Zervas

    Abstract: We consider the query recommendation problem in closed loop interactive learning settings like online information gathering and exploratory analytics. The problem can be naturally modelled using the Multi-Armed Bandits (MAB) framework with countably many arms. The standard MAB algorithms for countably many arms begin with selecting a random set of candidate arms and then applying standard MAB algo… ▽ More

    Submitted 31 August, 2021; originally announced August 2021.

    Report number: 2021

    Journal ref: Asian Conference on Machine Learning 2021