Search | arXiv e-print repository

Towards Reliable Testing for Multiple Information Retrieval System Comparisons

Authors: David Otero, Javier Parapar, Álvaro Barreiro

Abstract: Null Hypothesis Significance Testing is the \textit{de facto} tool for assessing effectiveness differences between Information Retrieval systems. Researchers use statistical tests to check whether those differences will generalise to online settings or are just due to the samples observed in the laboratory. Much work has been devoted to studying which test is the most reliable when comparing a pai… ▽ More Null Hypothesis Significance Testing is the \textit{de facto} tool for assessing effectiveness differences between Information Retrieval systems. Researchers use statistical tests to check whether those differences will generalise to online settings or are just due to the samples observed in the laboratory. Much work has been devoted to studying which test is the most reliable when comparing a pair of systems, but most of the IR real-world experiments involve more than two. In the multiple comparisons scenario, testing several systems simultaneously may inflate the errors committed by the tests. In this paper, we use a new approach to assess the reliability of multiple comparison procedures using simulated and real TREC data. Experiments show that Wilcoxon plus the Benjamini-Hochberg correction yields Type I error rates according to the significance level for typical sample sizes while being the best test in terms of statistical power. △ Less

Submitted 7 January, 2025; originally announced January 2025.

arXiv:2411.13212 [pdf, other]

Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation

Authors: David Otero, Javier Parapar, Álvaro Barreiro

Abstract: Offline evaluation of search systems depends on test collections. These benchmarks provide the researchers with a corpus of documents, topics and relevance judgements indicating which documents are relevant for each topic. While test collections are an integral part of Information Retrieval (IR) research, their creation involves significant efforts in manual annotation. Large language models (LLMs… ▽ More Offline evaluation of search systems depends on test collections. These benchmarks provide the researchers with a corpus of documents, topics and relevance judgements indicating which documents are relevant for each topic. While test collections are an integral part of Information Retrieval (IR) research, their creation involves significant efforts in manual annotation. Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment. Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements. These correlations are helpful in large-scale experiments but less informative if we want to focus on top-performing systems. Moreover, these correlations ignore whether and how LLM-based judgements impact the statistically significant differences among systems with respect to human assessments. In this work, we look at how LLM-generated judgements preserve ranking differences among top-performing systems and also how they preserve pairwise significance evaluation as human judgements. Our results show that LLM-based judgements are unfair at ranking top-performing systems. Moreover, we observe an exceedingly high rate of false positives regarding statistical differences. Our work represents a step forward in the evaluation of the reliability of using LLMs-based judgements for IR evaluation. We hope this will serve as a basis for other researchers to develop more reliable models for automatic relevance assessment. △ Less

Submitted 15 April, 2025; v1 submitted 20 November, 2024; originally announced November 2024.

arXiv:2410.04289 [pdf, other]

Self-Supervised Anomaly Detection in the Wild: Favor Joint Embeddings Methods

Authors: Daniel Otero, Rafael Mateus, Randall Balestriero

Abstract: Accurate anomaly detection is critical in vision-based infrastructure inspection, where it helps prevent costly failures and enhances safety. Self-Supervised Learning (SSL) offers a promising approach by learning robust representations from unlabeled data. However, its application in anomaly detection remains underexplored. This paper addresses this gap by providing a comprehensive evaluation of S… ▽ More Accurate anomaly detection is critical in vision-based infrastructure inspection, where it helps prevent costly failures and enhances safety. Self-Supervised Learning (SSL) offers a promising approach by learning robust representations from unlabeled data. However, its application in anomaly detection remains underexplored. This paper addresses this gap by providing a comprehensive evaluation of SSL methods for real-world anomaly detection, focusing on sewer infrastructure. Using the Sewer-ML dataset, we evaluate lightweight models such as ViT-Tiny and ResNet-18 across SSL frameworks, including BYOL, Barlow Twins, SimCLR, DINO, and MAE, under varying class imbalance levels. Through 250 experiments, we rigorously assess the performance of these SSL methods to ensure a robust and comprehensive evaluation. Our findings highlight the superiority of joint-embedding methods like SimCLR and Barlow Twins over reconstruction-based approaches such as MAE, which struggle to maintain performance under class imbalance. Furthermore, we find that the SSL model choice is more critical than the backbone architecture. Additionally, we emphasize the need for better label-free assessments of SSL representations, as current methods like RankMe fail to adequately evaluate representation quality, making cross-validation without labels infeasible. Despite the remaining performance gap between SSL and supervised models, these findings highlight the potential of SSL to enhance anomaly detection, paving the way for further research in this underexplored area of SSL applications. △ Less

Submitted 5 October, 2024; originally announced October 2024.

arXiv:2409.02140 [pdf, ps, other]

Self-Supervised Learning for Identifying Defects in Sewer Footage

Authors: Daniel Otero, Rafael Mateus

Abstract: Sewerage infrastructure is among the most expensive modern investments requiring time-intensive manual inspections by qualified personnel. Our study addresses the need for automated solutions without relying on large amounts of labeled data. We propose a novel application of Self-Supervised Learning (SSL) for sewer inspection that offers a scalable and cost-effective solution for defect detection.… ▽ More Sewerage infrastructure is among the most expensive modern investments requiring time-intensive manual inspections by qualified personnel. Our study addresses the need for automated solutions without relying on large amounts of labeled data. We propose a novel application of Self-Supervised Learning (SSL) for sewer inspection that offers a scalable and cost-effective solution for defect detection. We achieve competitive results with a model that is at least 5 times smaller than other approaches found in the literature and obtain competitive performance with 10\% of the available data when training with a larger architecture. Our findings highlight the potential of SSL to revolutionize sewer maintenance in resource-limited settings. △ Less

Submitted 2 September, 2024; originally announced September 2024.

Comments: Poster at the LatinX in AI Workshop @ ICML 2024

arXiv:2308.09340 [pdf, other]

doi 10.1145/3583780.3614916

How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication Methods

Authors: David Otero, Javier Parapar, Nicola Ferro

Abstract: Creating test collections for offline retrieval evaluation requires human effort to judge documents' relevance. This expensive activity motivated much work in developing methods for constructing benchmarks with fewer assessment costs. In this respect, adjudication methods actively decide both which documents and the order in which experts review them, in order to better exploit the assessment budg… ▽ More Creating test collections for offline retrieval evaluation requires human effort to judge documents' relevance. This expensive activity motivated much work in developing methods for constructing benchmarks with fewer assessment costs. In this respect, adjudication methods actively decide both which documents and the order in which experts review them, in order to better exploit the assessment budget or to lower it. Researchers evaluate the quality of those methods by measuring the correlation between the known gold ranking of systems under the full collection and the observed ranking of systems under the lower-cost one. This traditional analysis ignores whether and how the low-cost judgements impact on the statistically significant differences among systems with respect to the full collection. We fill this void by proposing a novel methodology to evaluate how the low-cost adjudication methods preserve the pairwise significant differences between systems as the full collection. In other terms, while traditional approaches look for stability in answering the question "is system A better than system B?", our proposed approach looks for stability in answering the question "is system A significantly better than system B?", which is the ultimate questions researchers need to answer to guarantee the generalisability of their results. Among other results, we found that the best methods in terms of ranking of systems correlation do not always match those preserving statistical significance. △ Less

Submitted 28 August, 2023; v1 submitted 18 August, 2023; originally announced August 2023.

Journal ref: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM 2023)

arXiv:2002.02657 [pdf, ps, other]

Optimization of Structural Similarity in Mathematical Imaging

Authors: D. Otero, D. La Torre, O. Michailovich, E. R. Vrscay

Abstract: It is now generally accepted that Euclidean-based metrics may not always adequately represent the subjective judgement of a human observer. As a result, many image processing methodologies have been recently extended to take advantage of alternative visual quality measures, the most prominent of which is the Structural Similarity Index Measure (SSIM). The superiority of the latter over Euclidean-b… ▽ More It is now generally accepted that Euclidean-based metrics may not always adequately represent the subjective judgement of a human observer. As a result, many image processing methodologies have been recently extended to take advantage of alternative visual quality measures, the most prominent of which is the Structural Similarity Index Measure (SSIM). The superiority of the latter over Euclidean-based metrics have been demonstrated in several studies. However, being focused on specific applications, the findings of such studies often lack generality which, if otherwise acknowledged, could have provided a useful guidance for further development of SSIM-based image processing algorithms. Accordingly, instead of focusing on a particular image processing task, in this paper, we introduce a general framework that encompasses a wide range of imaging applications in which the SSIM can be employed as a fidelity measure. Subsequently, we show how the framework can be used to cast some standard as well as original imaging tasks into optimization problems, followed by a discussion of a number of novel numerical strategies for their solution. △ Less

Submitted 7 February, 2020; originally announced February 2020.

arXiv:1908.02798 [pdf, other]

doi 10.1109/JIOT.2019.2959552

Extreme coverage in 5G Narrowband IoT: a LUT-based strategy to optimize shared channels

Authors: Emmanuel Luján, Juan A. Zuloaga Mellino, Alejandro D. Otero, Leonardo Rey Vega, Cecilia G. Galarza, Esteban E. Mocskos

Abstract: One of the main challenges in IoT is providing communication support to an increasing number of connected devices. In recent years, narrowband radio technology has emerged to address this situation: Narrowband Internet of Things (NB-IoT), which is now part of 5G. Supporting massive connectivity becomes particularly demanding in extreme coverage scenarios such as underground or deep inside building… ▽ More One of the main challenges in IoT is providing communication support to an increasing number of connected devices. In recent years, narrowband radio technology has emerged to address this situation: Narrowband Internet of Things (NB-IoT), which is now part of 5G. Supporting massive connectivity becomes particularly demanding in extreme coverage scenarios such as underground or deep inside buildings sites. We propose a novel strategy for these situations focused on optimizing NB-IoT shared channels through the selection of link parameters: modulation and coding scheme, as well as the number of repetitions. These parameters are established by the base station (BS) for each block transmitted until reaching a target block error rate (BLER_t ). A wrong selection of these magnitudes leads to radio resource waste and a decrease in the number of possible concurrent connections. Specifically, our strategy is based on a look-up table (LUT) scheme which is used for rapidly delivering the optimal link parameters given a target QoS. To validate our proposal, we compare with alternative strategies using an open source NB-IoT uplink simulator. The experiments are based on transmitting blocks of 256 bits using an AWGN channel over the NPUSCH. Results show that, especially under extreme conditions, only a few options for link parameters are available, favoring robustness against measurement uncertainties. Our strategy minimizes resource usage in all scenarios of acknowledged mode and remarkably reduces losses in the unacknowledged mode, presenting also substantial gains in performance. We expect to influence future BS software design and implementation, favoring connection support under extreme environments. △ Less

Submitted 24 December, 2019; v1 submitted 7 August, 2019; originally announced August 2019.

Comments: Paper accepted at IEEE IoT Journal

Showing 1–7 of 7 results for author: Otero, D