-
Manifesto from Dagstuhl Perspectives Workshop 24352 -- Conversational Agents: A Framework for Evaluation (CAFE)
Authors:
Christine Bauer,
Li Chen,
Nicola Ferro,
Norbert Fuhr,
Avishek Anand,
Timo Breuer,
Guglielmo Faggioli,
Ophir Frieder,
Hideo Joho,
Jussi Karlgren,
Johannes Kiesel,
Bart P. Knijnenburg,
Aldo Lipani,
Lien Michiels,
Andrea Papenmeier,
Maria Soledad Pera,
Mark Sanderson,
Scott Sanner,
Benno Stein,
Johanne R. Trippas,
Karin Verspoor,
Martijn C Willemsen
Abstract:
During the workshop, we deeply discussed what CONversational Information ACcess (CONIAC) is and its unique features, proposing a world model abstracting it, and defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems, consisting of six major components: 1) goals of the system's stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of…
▽ More
During the workshop, we deeply discussed what CONversational Information ACcess (CONIAC) is and its unique features, proposing a world model abstracting it, and defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems, consisting of six major components: 1) goals of the system's stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of the users carrying out the tasks, 4) evaluation criteria to be considered, 5) evaluation methodology to be applied, and 6) measures for the quantitative criteria chosen.
△ Less
Submitted 8 June, 2025;
originally announced June 2025.
-
Reproducibility and Artifact Consistency of the SIGIR 2022 Recommender Systems Papers Based on Message Passing
Authors:
Maurizio Ferrari Dacrema,
Michael Benigni,
Nicola Ferro
Abstract:
Graph-based techniques relying on neural networks and embeddings have gained attention as a way to develop Recommender Systems (RS) with several papers on the topic presented at SIGIR 2022 and 2023. Given the importance of ensuring that published research is methodologically sound and reproducible, in this paper we analyze 10 graph-based RS papers, most of which were published at SIGIR 2022, and a…
▽ More
Graph-based techniques relying on neural networks and embeddings have gained attention as a way to develop Recommender Systems (RS) with several papers on the topic presented at SIGIR 2022 and 2023. Given the importance of ensuring that published research is methodologically sound and reproducible, in this paper we analyze 10 graph-based RS papers, most of which were published at SIGIR 2022, and assess their impact on subsequent work published in SIGIR 2023. Our analysis reveals several critical points that require attention: (i) the prevalence of bad practices, such as erroneous data splits or information leakage between training and testing data, which call into question the validity of the results; (ii) frequent inconsistencies between the provided artifacts (source code and data) and their descriptions in the paper, causing uncertainty about what is actually being evaluated; and (iii) the preference for new or complex baselines that are weaker compared to simpler ones, creating the impression of continuous improvement even when, particularly for the Amazon-Book dataset, the state-of-the-art has significantly worsened. Due to these issues, we are unable to confirm the claims made in most of the papers we examined and attempted to reproduce.
△ Less
Submitted 10 March, 2025;
originally announced March 2025.
-
Words Blending Boxes. Obfuscating Queries in Information Retrieval using Differential Privacy
Authors:
Francesco Luigi De Faveri,
Guglielmo Faggioli,
Nicola Ferro
Abstract:
Ensuring the effectiveness of search queries while protecting user privacy remains an open issue. When an Information Retrieval System (IRS) does not protect the privacy of its users, sensitive information may be disclosed through the queries sent to the system. Recent improvements, especially in NLP, have shown the potential of using Differential Privacy to obfuscate texts while maintaining satis…
▽ More
Ensuring the effectiveness of search queries while protecting user privacy remains an open issue. When an Information Retrieval System (IRS) does not protect the privacy of its users, sensitive information may be disclosed through the queries sent to the system. Recent improvements, especially in NLP, have shown the potential of using Differential Privacy to obfuscate texts while maintaining satisfactory effectiveness. However, such approaches may protect the user's privacy only from a theoretical perspective while, in practice, the real user's information need can still be inferred if perturbed terms are too semantically similar to the original ones. We overcome such limitations by proposing Word Blending Boxes, a novel differentially private mechanism for query obfuscation, which protects the words in the user queries by employing safe boxes. To measure the overall effectiveness of the proposed WBB mechanism, we measure the privacy obtained by the obfuscation process, i.e., the lexical and semantic similarity between original and obfuscated queries. Moreover, we assess the effectiveness of the privatized queries in retrieving relevant documents from the IRS. Our findings indicate that WBB can be integrated effectively into existing IRSs, offering a key to the challenge of protecting user privacy from both a theoretical and a practical point of view.
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Level set-fitted polytopal meshes with application to structural topology optimization
Authors:
Nicola Ferro,
Stefano Micheletti,
Nicola Parolini,
Simona Perotto,
Marco Verani,
Paola Francesca Antonietti
Abstract:
We propose a method to modify a polygonal mesh in order to fit the zero-isoline of a level set function by extending a standard body-fitted strategy to a tessellation with arbitrarily-shaped elements. The novel level set-fitted approach, in combination with a Discontinuous Galerkin finite element approximation, provides an ideal setting to model physical problems characterized by embedded or evolv…
▽ More
We propose a method to modify a polygonal mesh in order to fit the zero-isoline of a level set function by extending a standard body-fitted strategy to a tessellation with arbitrarily-shaped elements. The novel level set-fitted approach, in combination with a Discontinuous Galerkin finite element approximation, provides an ideal setting to model physical problems characterized by embedded or evolving complex geometries, since it allows skipping any mesh post-processing in terms of grid quality. The proposed methodology is firstly assessed on the linear elasticity equation, by verifying the approximation capability of the level set-fitted approach when dealing with configurations with heterogeneous material properties. Successively, we combine the level set-fitted methodology with a minimum compliance topology optimization technique, in order to deliver optimized layouts exhibiting crisp boundaries and reliable mechanical performances. An extensive numerical test campaign confirms the effectiveness of the proposed method.
△ Less
Submitted 2 November, 2023; v1 submitted 20 September, 2023;
originally announced September 2023.
-
How Discriminative Are Your Qrels? How To Study the Statistical Significance of Document Adjudication Methods
Authors:
David Otero,
Javier Parapar,
Nicola Ferro
Abstract:
Creating test collections for offline retrieval evaluation requires human effort to judge documents' relevance. This expensive activity motivated much work in developing methods for constructing benchmarks with fewer assessment costs. In this respect, adjudication methods actively decide both which documents and the order in which experts review them, in order to better exploit the assessment budg…
▽ More
Creating test collections for offline retrieval evaluation requires human effort to judge documents' relevance. This expensive activity motivated much work in developing methods for constructing benchmarks with fewer assessment costs. In this respect, adjudication methods actively decide both which documents and the order in which experts review them, in order to better exploit the assessment budget or to lower it. Researchers evaluate the quality of those methods by measuring the correlation between the known gold ranking of systems under the full collection and the observed ranking of systems under the lower-cost one. This traditional analysis ignores whether and how the low-cost judgements impact on the statistically significant differences among systems with respect to the full collection. We fill this void by proposing a novel methodology to evaluate how the low-cost adjudication methods preserve the pairwise significant differences between systems as the full collection. In other terms, while traditional approaches look for stability in answering the question "is system A better than system B?", our proposed approach looks for stability in answering the question "is system A significantly better than system B?", which is the ultimate questions researchers need to answer to guarantee the generalisability of their results. Among other results, we found that the best methods in terms of ranking of systems correlation do not always match those preserving statistical significance.
△ Less
Submitted 28 August, 2023; v1 submitted 18 August, 2023;
originally announced August 2023.
-
Report from Dagstuhl Seminar 23031: Frontiers of Information Access Experimentation for Research and Education
Authors:
Christine Bauer,
Ben Carterette,
Nicola Ferro,
Norbert Fuhr
Abstract:
This report documents the program and the outcomes of Dagstuhl Seminar 23031 ``Frontiers of Information Access Experimentation for Research and Education'', which brought together 37 participants from 12 countries.
The seminar addressed technology-enhanced information access (information retrieval, recommender systems, natural language processing) and specifically focused on developing more resp…
▽ More
This report documents the program and the outcomes of Dagstuhl Seminar 23031 ``Frontiers of Information Access Experimentation for Research and Education'', which brought together 37 participants from 12 countries.
The seminar addressed technology-enhanced information access (information retrieval, recommender systems, natural language processing) and specifically focused on developing more responsible experimental practices leading to more valid results, both for research as well as for scientific education.
The seminar brought together experts from various sub-fields of information access, namely IR, RS, NLP, information science, and human-computer interaction to create a joint understanding of the problems and challenges presented by next generation information access systems, from both the research and the experimentation point of views, to discuss existing solutions and impediments, and to propose next steps to be pursued in the area in order to improve not also our research methods and findings but also the education of the new generation of researchers and developers.
The seminar featured a series of long and short talks delivered by participants, who helped in setting a common ground and in letting emerge topics of interest to be explored as the main output of the seminar. This led to the definition of five groups which investigated challenges, opportunities, and next steps in the following areas: reality check, i.e. conducting real-world studies, human-machine-collaborative relevance judgment frameworks, overcoming methodological challenges in information retrieval and recommender systems through awareness and education, results-blind reviewing, and guidance for authors.
△ Less
Submitted 18 April, 2023;
originally announced May 2023.
-
Query Performance Prediction for Neural IR: Are We There Yet?
Authors:
Guglielmo Faggioli,
Thibault Formal,
Stefano Marchesin,
Stéphane Clinchant,
Nicola Ferro,
Benjamin Piwowarski
Abstract:
Evaluation in Information Retrieval relies on post-hoc empirical procedures, which are time-consuming and expensive operations. To alleviate this, Query Performance Prediction (QPP) models have been developed to estimate the performance of a system without the need for human-made relevance judgements. Such models, usually relying on lexical features from queries and corpora, have been applied to t…
▽ More
Evaluation in Information Retrieval relies on post-hoc empirical procedures, which are time-consuming and expensive operations. To alleviate this, Query Performance Prediction (QPP) models have been developed to estimate the performance of a system without the need for human-made relevance judgements. Such models, usually relying on lexical features from queries and corpora, have been applied to traditional sparse IR methods - with various degrees of success. With the advent of neural IR and large Pre-trained Language Models, the retrieval paradigm has significantly shifted towards more semantic signals. In this work, we study and analyze to what extent current QPP models can predict the performance of such systems. Our experiments consider seven traditional bag-of-words and seven BERT-based IR approaches, as well as nineteen state-of-the-art QPPs evaluated on two collections, Deep Learning '19 and Robust '04. Our findings show that QPPs perform statistically significantly worse on neural IR systems. In settings where semantic signals are prominent (e.g., passage retrieval), their performance on neural models drops by as much as 10% compared to bag-of-words approaches. On top of that, in lexical-oriented scenarios, QPPs fail to predict performance for neural IR systems on those queries where they differ from traditional approaches the most.
△ Less
Submitted 20 February, 2023;
originally announced February 2023.
-
Response to Moffat's Comment on "Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales"
Authors:
Marco Ferrante,
Nicola Ferro,
Norbert Fuhr
Abstract:
Moffat recently commented on our previous work. Our work focused on how laying the foundations of our evaluation methodology into the theory of measurement can improve our knowledge and understanding of the evaluation measures we use in IR and how it can shed light on the different types of scales adopted by our evaluation measures; we also provided evidence, through extensive experimentation, on…
▽ More
Moffat recently commented on our previous work. Our work focused on how laying the foundations of our evaluation methodology into the theory of measurement can improve our knowledge and understanding of the evaluation measures we use in IR and how it can shed light on the different types of scales adopted by our evaluation measures; we also provided evidence, through extensive experimentation, on the impact of the different types of scales on the statistical analyses, as well as on the impact of departing from their assumptions. Moreover, we investigated, for the first time in IR, the concept of meaningfulness, i.e. the invariance of the experimental statements and inferences you draw, and proposed it as a way to ensure more valid and generalizabile results. Moffat's comments build on: (i) misconceptions about the representational theory of measurement, such as what an interval scale actually is and what axioms it has to comply with; (ii) they totally miss the central concept of meaningfulness. Therefore, we reply to Moffat's comments by properly framing them in the representational theory of measurement and in the concept of meaningfulness. All in all, we can only reiterate what we said several times: the goal of this research line is to theoretically ground our evaluation methodology - and IR is a field where it is extremely challenging to perform any theoretical advances - in order to aim for more robust and generalizable inferences - something we currently lack in the field. Possibly there are other and better ways to achieve this objective and these proposals could emerge from an open discussion in the field and from the work of others. On the other hand, reducing everything to a contrast on what is (or pretend to be) an interval scale or whether all or none evaluation measures are interval scales may be more a barrier from than a help in progressing towards this goal.
△ Less
Submitted 22 December, 2022;
originally announced December 2022.
-
FullBrain: a Social E-learning Platform
Authors:
Mirko Biasini,
Vittorio Carmignani,
Nicola Ferro,
Panagiotis Filianos,
Maria Maistro,
Giorgio Maria di Nunzio
Abstract:
We present FullBrain, a social e-learning platform where students share and track their knowledge. FullBrain users can post notes, ask questions and share learning resources in dedicated course and concept spaces. We detail two components of FullBrain: a SIR system equipped with query autocomplete and query autosuggestion, and a Leaderboard module to improve user experience. We analyzed the day-to…
▽ More
We present FullBrain, a social e-learning platform where students share and track their knowledge. FullBrain users can post notes, ask questions and share learning resources in dedicated course and concept spaces. We detail two components of FullBrain: a SIR system equipped with query autocomplete and query autosuggestion, and a Leaderboard module to improve user experience. We analyzed the day-to-day users' usage of the SIR system, measuring a time-to-complete a request below 0.11s, matching or exceeding our UX targets. Moreover, we performed stress tests which lead the way for more detailed analysis. Through a preliminary user study and log data analysis, we observe that 97% of the users' activity is directed to the top 4 positions in the leaderboard.
△ Less
Submitted 1 December, 2022;
originally announced December 2022.
-
Corrected Evaluation Results of the NTCIR WWW-2, WWW-3, and WWW-4 English Subtasks
Authors:
Tetsuya Sakai,
Sijie Tao,
Maria Maistro,
Zhumin Chu,
Yujing Li,
Nuo Chen,
Nicola Ferro,
Junjie Wang,
Ian Soboroff,
Yiqun Liu
Abstract:
Unfortunately, the official English (sub)task results reported in the NTCIR-14 WWW-2, NTCIR-15 WWW-3, and NTCIR-16 WWW-4 overview papers are incorrect due to noise in the official qrels files; this paper reports results based on the corrected qrels files. The noise is due to a fatal bug in the backend of our relevance assessment interface. More specifically, at WWW-2, WWW-3, and WWW-4, two version…
▽ More
Unfortunately, the official English (sub)task results reported in the NTCIR-14 WWW-2, NTCIR-15 WWW-3, and NTCIR-16 WWW-4 overview papers are incorrect due to noise in the official qrels files; this paper reports results based on the corrected qrels files. The noise is due to a fatal bug in the backend of our relevance assessment interface. More specifically, at WWW-2, WWW-3, and WWW-4, two versions of pool files were created for each English topic: a PRI ("prioritised") file, which uses the NTCIRPOOL script to prioritise likely relevant documents, and a RND ("randomised") file, which randomises the pooled documents. This was done for the purpose of studying the effect of document ordering for relevance assessors. However, the programmer who wrote the interface backend assumed that a combination of a topic ID and a document rank in the pool file uniquely determines a document ID; this is obviously incorrect as we have two versions of pool files. The outcome is that all the PRI-based relevance labels for the WWW-2 test collection are incorrect (while all the RND-based relevance labels are correct), and all the RND-based relevance labels for the WWW-3 and WWW-4 test collections are incorrect (while all the PRI-based relevance labels are correct). This bug was finally discovered at the NTCIR-16 WWW-4 task when the first seven authors of this paper served as Gold assessors (i.e., topic creators who define what is relevant) and closely examined the disagreements with Bronze assessors (i.e., non-topic-creators; non-experts). We would like to apologise to the WWW participants and the NTCIR chairs for the inconvenience and confusion caused due to this bug.
△ Less
Submitted 18 October, 2022;
originally announced October 2022.
-
A deep learning approach for detection and localization of leaf anomalies
Authors:
Davide Calabrò,
Massimiliano Lupo Pasini,
Nicola Ferro,
Simona Perotto
Abstract:
The detection and localization of possible diseases in crops are usually automated by resorting to supervised deep learning approaches. In this work, we tackle these goals with unsupervised models, by applying three different types of autoencoders to a specific open-source dataset of healthy and unhealthy pepper and cherry leaf images. CAE, CVAE and VQ-VAE autoencoders are deployed to screen unlab…
▽ More
The detection and localization of possible diseases in crops are usually automated by resorting to supervised deep learning approaches. In this work, we tackle these goals with unsupervised models, by applying three different types of autoencoders to a specific open-source dataset of healthy and unhealthy pepper and cherry leaf images. CAE, CVAE and VQ-VAE autoencoders are deployed to screen unlabeled images of such a dataset, and compared in terms of image reconstruction, anomaly removal, detection and localization. The vector-quantized variational architecture turns out to be the best performing one with respect to all these targets.
△ Less
Submitted 7 October, 2022;
originally announced October 2022.
-
Enhancing level set-based topology optimization with anisotropic graded meshes
Authors:
Davide Cortellessa,
Nicola Ferro,
Simona Perotto,
Stefano Micheletti
Abstract:
We propose a new algorithm for the design of topologically optimized lightweight structures, under a minimum compliance requirement. The new process enhances a standard level set formulation in terms of computational efficiency, thanks to the employment of a strategic computational mesh. We pursue a twofold goal, i.e., to deliver a final layout characterized by a smooth contour and reliable mechan…
▽ More
We propose a new algorithm for the design of topologically optimized lightweight structures, under a minimum compliance requirement. The new process enhances a standard level set formulation in terms of computational efficiency, thanks to the employment of a strategic computational mesh. We pursue a twofold goal, i.e., to deliver a final layout characterized by a smooth contour and reliable mechanical properties. The smoothness of the optimized structure is ensured by the employment of an anisotropic adapted mesh, which sharply captures the material/void interface. A robust mechanical performance is guaranteed by a uniform tessellation of the internal part of the optimized configuration. A thorough numerical investigation corroborates the effectiveness of the proposed algorithm as a reliable and computationally affordable design tool, both in two- and three-dimensional contexts.
△ Less
Submitted 22 August, 2022;
originally announced August 2022.
-
A new fluid-based strategy for the connection of non-matching lattice materials
Authors:
Nicola Ferro,
Simona Perotto,
Matteo Gavazzoni
Abstract:
We present a new algorithm for the design of the connection region between different lattice materials. We solve a Stokes-type topology optimization problem on a narrow morphing region to smoothly connect two different unit cells. The proposed procedure turns out to be effective and provides a local re-design of the materials, leading to a very mild modification of the mechanical behaviour charact…
▽ More
We present a new algorithm for the design of the connection region between different lattice materials. We solve a Stokes-type topology optimization problem on a narrow morphing region to smoothly connect two different unit cells. The proposed procedure turns out to be effective and provides a local re-design of the materials, leading to a very mild modification of the mechanical behaviour characterizing the original lattices. The robustness of the algorithm is assessed in terms of sensitivity of the final layout to different parameters. Both the cases of Cartesian and non-Cartesian morphing regions are successfully investigated.
△ Less
Submitted 2 June, 2022;
originally announced June 2022.
-
Towards Feature Selection for Ranking and Classification Exploiting Quantum Annealers
Authors:
Maurizio Ferrari Dacrema,
Fabio Moroni,
Riccardo Nembrini,
Nicola Ferro,
Guglielmo Faggioli,
Paolo Cremonesi
Abstract:
Feature selection is a common step in many ranking, classification, or prediction tasks and serves many purposes. By removing redundant or noisy features, the accuracy of ranking or classification can be improved and the computational cost of the subsequent learning steps can be reduced. However, feature selection can be itself a computationally expensive process. While for decades confined to the…
▽ More
Feature selection is a common step in many ranking, classification, or prediction tasks and serves many purposes. By removing redundant or noisy features, the accuracy of ranking or classification can be improved and the computational cost of the subsequent learning steps can be reduced. However, feature selection can be itself a computationally expensive process. While for decades confined to theoretical algorithmic papers, quantum computing is now becoming a viable tool to tackle realistic problems, in particular special-purpose solvers based on the Quantum Annealing paradigm. This paper aims to explore the feasibility of using currently available quantum computing architectures to solve some quadratic feature selection algorithms for both ranking and classification. The experimental analysis includes 15 state-of-the-art datasets. The effectiveness obtained with quantum computing hardware is comparable to that of classical solvers, indicating that quantum computers are now reliable enough to tackle interesting problems. In terms of scalability, current generation quantum computers are able to provide a limited speedup over certain classical algorithms and hybrid quantum-classical strategies show lower computational cost for problems of more than a thousand features.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.
-
repro_eval: A Python Interface to Reproducibility Measures of System-oriented IR Experiments
Authors:
Timo Breuer,
Nicola Ferro,
Maria Maistro,
Philipp Schaer
Abstract:
In this work we introduce repro_eval - a tool for reactive reproducibility studies of system-oriented information retrieval (IR) experiments. The corresponding Python package provides IR researchers with measures for different levels of reproduction when evaluating their systems' outputs. By offering an easily extensible interface, we hope to stimulate common practices when conducting a reproducib…
▽ More
In this work we introduce repro_eval - a tool for reactive reproducibility studies of system-oriented information retrieval (IR) experiments. The corresponding Python package provides IR researchers with measures for different levels of reproduction when evaluating their systems' outputs. By offering an easily extensible interface, we hope to stimulate common practices when conducting a reproducibility study of system-oriented IR experiments.
△ Less
Submitted 19 January, 2022;
originally announced January 2022.
-
Multi-physics inverse homogenization for the design of innovative cellular materials: application to thermo-mechanical problems
Authors:
Matteo Gavazzoni,
Nicola Ferro,
Simona Perotto,
Stefano Foletti
Abstract:
We present a new algorithm to design lightweight cellular materials with required properties in a multi-physics context. In particular, we focus on a thermo-mechanical setting, by promoting the design of unit cells characterized both by an isotropic and an anisotropic behaviour with respect to mechanical and thermal requirements. The proposed procedure generalizes microSIMPATY algorithm to a multi…
▽ More
We present a new algorithm to design lightweight cellular materials with required properties in a multi-physics context. In particular, we focus on a thermo-mechanical setting, by promoting the design of unit cells characterized both by an isotropic and an anisotropic behaviour with respect to mechanical and thermal requirements. The proposed procedure generalizes microSIMPATY algorithm to a multi-physics framework, by preserving all the good properties of the reference design methodology. The resulting layouts exhibit non-standard topologies and are characterized by very sharp contours, thus limiting the post-processing before manufacturing. The new cellular materials are compared with the state-of-art in engineering practice in terms of thermo-mechanical properties, thus highlighting the good performance of the new layouts which, in some cases, outperform the consolidated choices.
△ Less
Submitted 4 January, 2022;
originally announced January 2022.
-
Towards Meaningful Statements in IR Evaluation. Mapping Evaluation Measures to Interval Scales
Authors:
Marco Ferrante,
Nicola Ferro,
Norbert Fuhr
Abstract:
Recently, it was shown that most popular IR measures are not interval-scaled, implying that decades of experimental IR research used potentially improper methods, which may have produced questionable results. However, it was unclear if and to what extent these findings apply to actual evaluations and this opened a debate in the community with researchers standing on opposite positions about whethe…
▽ More
Recently, it was shown that most popular IR measures are not interval-scaled, implying that decades of experimental IR research used potentially improper methods, which may have produced questionable results. However, it was unclear if and to what extent these findings apply to actual evaluations and this opened a debate in the community with researchers standing on opposite positions about whether this should be considered an issue (or not) and to what extent.
In this paper, we first give an introduction to the representational measurement theory explaining why certain operations and significance tests are permissible only with scales of a certain level. For that, we introduce the notion of meaningfulness specifying the conditions under which the truth (or falsity) of a statement is invariant under permissible transformations of a scale. Furthermore, we show how the recall base and the length of the run may make comparison and aggregation across topics problematic.
Then we propose a straightforward and powerful approach for turning an evaluation measure into an interval scale, and describe an experimental evaluation of the differences between using the original measures and the interval-scaled ones.
For all the regarded measures - namely Precision, Recall, Average Precision, (Normalized) Discounted Cumulative Gain, Rank-Biased Precision and Reciprocal Rank - we observe substantial effects, both on the order of average values and on the outcome of significance tests. For the latter, previously significant differences turn out to be insignificant, while insignificant ones become significant. The effect varies remarkably between the tests considered but overall, on average, we observed a 25% change in the decision about which systems are significantly different and which are not.
△ Less
Submitted 7 January, 2021;
originally announced January 2021.
-
How to Measure the Reproducibility of System-oriented IR Experiments
Authors:
Timo Breuer,
Nicola Ferro,
Norbert Fuhr,
Maria Maistro,
Tetsuya Sakai,
Philipp Schaer,
Ian Soboroff
Abstract:
Replicability and reproducibility of experimental results are primary concerns in all the areas of science and IR is not an exception. Besides the problem of moving the field towards more reproducible experimental practices and protocols, we also face a severe methodological issue: we do not have any means to assess when reproduced is reproduced. Moreover, we lack any reproducibility-oriented data…
▽ More
Replicability and reproducibility of experimental results are primary concerns in all the areas of science and IR is not an exception. Besides the problem of moving the field towards more reproducible experimental practices and protocols, we also face a severe methodological issue: we do not have any means to assess when reproduced is reproduced. Moreover, we lack any reproducibility-oriented dataset, which would allow us to develop such methods. To address these issues, we compare several measures to objectively quantify to what extent we have replicated or reproduced a system-oriented IR experiment. These measures operate at different levels of granularity, from the fine-grained comparison of ranked lists, to the more general comparison of the obtained effects and significant differences. Moreover, we also develop a reproducibility-oriented dataset, which allows us to validate our measures and which can also be used to develop future measures.
△ Less
Submitted 26 October, 2020;
originally announced October 2020.