Search | arXiv e-print repository

doi 10.1145/3626772.3657861

Systematic Evaluation of Neural Retrieval Models on the Touché 2020 Argument Retrieval Subset of BEIR

Authors: Nandan Thakur, Luiz Bonifacio, Maik Fröbe, Alexander Bondarenko, Ehsan Kamalloo, Martin Potthast, Matthias Hagen, Jimmy Lin

Abstract: The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touché 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what… ▽ More The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset Touché 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what makes argument retrieval so "special". To more deeply analyze the respective potential limits of neural retrieval models, we run a reproducibility study on the Touché 2020 data. In our study, we focus on two experiments: (i) a black-box evaluation (i.e., no model retraining), incorporating a theoretical exploration using retrieval axioms, and (ii) a data denoising evaluation involving post-hoc relevance judgments. Our black-box evaluation reveals an inherent bias of neural models towards retrieving short passages from the Touché 2020 data, and we also find that quite a few of the neural models' results are unjudged in the Touché 2020 data. As many of the short Touché passages are not argumentative and thus non-relevant per se, and as the missing judgments complicate fair comparison, we denoise the Touché 2020 data by excluding very short passages (less than 20 words) and by augmenting the unjudged data with post-hoc judgments following the Touché guidelines. On the denoised data, the effectiveness of the neural models improves by up to 0.52 in nDCG@10, but BM25 is still more effective. Our code and the augmented Touché 2020 dataset are available at \url{https://github.com/castorini/touche-error-analysis}. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: SIGIR 2024 (Resource & Reproducibility Track)

arXiv:2312.11361 [pdf, other]

"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation

Authors: Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen, Mehdi Rezagholizadeh, Jimmy Lin

Abstract: Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. However, prior work lacks a comprehensive evaluation of different language families, making it challenging to evaluate LLM robustness against errors in external retrieved knowledge. To overcome this, we establish NoMIRACL, a human-annotated datas… ▽ More Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. However, prior work lacks a comprehensive evaluation of different language families, making it challenging to evaluate LLM robustness against errors in external retrieved knowledge. To overcome this, we establish NoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. NoMIRACL includes both a non-relevant and a relevant subset. Queries in the non-relevant subset contain passages judged as non-relevant, whereas queries in the relevant subset include at least a single judged relevant passage. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.In our work, we observe that most models struggle to balance the two capacities. Models such as LLAMA-2 and Orca-2 achieve over 88% hallucination rate on the non-relevant subset. Mistral and LLAMA-3 hallucinate less but can achieve up to a 74.9% error rate on the relevant subset. Overall, GPT-4 is observed to provide the best tradeoff on both subsets, highlighting future work necessary to improve LLM robustness. NoMIRACL dataset and evaluation code are available at: https://github.com/project-miracl/nomiracl. △ Less

Submitted 10 November, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

Comments: EMNLP 2024 (Findings)

arXiv:2308.07177 [pdf, ps, other]

Conformance Checking for Pushdown Reactive Systems based on Visibly Pushdown Languages

Authors: Adilson Luiz Bonifacio

Abstract: Testing pushdown reactive systems is deemed important to guarantee a precise and robust software development process. Usually, such systems can be specified by the formalism of Input/Output Visibly Pushdown Labeled Transition System (IOVPTS), where the interaction with the environment is regulated by a pushdown memory. Hence a conformance checking can be applied in a testing process to verify whet… ▽ More Testing pushdown reactive systems is deemed important to guarantee a precise and robust software development process. Usually, such systems can be specified by the formalism of Input/Output Visibly Pushdown Labeled Transition System (IOVPTS), where the interaction with the environment is regulated by a pushdown memory. Hence a conformance checking can be applied in a testing process to verify whether an implementation is in compliance to a specification using an appropriate conformance relation. In this work we establish a novelty conformance relation based on Visibly Pushdown Languages (VPLs) that can model sets of desirable and undesirable behaviors of systems. Further, we show that test suites with a complete fault coverage can be generated using this conformance relation for pushdown reactive systems. △ Less

Submitted 14 August, 2023; originally announced August 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2107.11421

arXiv:2307.04601 [pdf, ps, other]

InPars Toolkit: A Unified and Reproducible Synthetic Data Generation Pipeline for Neural Information Retrieval

Authors: Hugo Abonizio, Luiz Bonifacio, Vitor Jeronymo, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

Abstract: Recent work has explored Large Language Models (LLMs) to overcome the lack of training data for Information Retrieval (IR) tasks. The generalization abilities of these models have enabled the creation of synthetic in-domain data by providing instructions and a few examples on a prompt. InPars and Promptagator have pioneered this approach and both methods have demonstrated the potential of using LL… ▽ More Recent work has explored Large Language Models (LLMs) to overcome the lack of training data for Information Retrieval (IR) tasks. The generalization abilities of these models have enabled the creation of synthetic in-domain data by providing instructions and a few examples on a prompt. InPars and Promptagator have pioneered this approach and both methods have demonstrated the potential of using LLMs as synthetic data generators for IR tasks. This makes them an attractive solution for IR tasks that suffer from a lack of annotated data. However, the reproducibility of these methods was limited, because InPars' training scripts are based on TPUs -- which are not widely accessible -- and because the code for Promptagator was not released and its proprietary LLM is not publicly accessible. To fully realize the potential of these methods and make their impact more widespread in the research community, the resources need to be accessible and easy to reproduce by researchers and practitioners. Our main contribution is a unified toolkit for end-to-end reproducible synthetic data generation research, which includes generation, filtering, training and evaluation. Additionally, we provide an interface to IR libraries widely used by the community and support for GPU. Our toolkit not only reproduces the InPars method and partially reproduces Promptagator, but also provides a plug-and-play functionality allowing the use of different LLMs, exploring filtering methods and finetuning various reranker models on the generated data. We also made available all the synthetic data generated in this work for the 18 different datasets in the BEIR benchmark which took more than 2,000 GPU hours to be generated as well as the reranker models finetuned on the synthetic data. Code and data are available at https://github.com/zetaalphavector/InPars △ Less

Submitted 10 July, 2023; originally announced July 2023.

arXiv:2301.01820 [pdf, ps, other]

InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval

Authors: Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, Rodrigo Nogueira

Abstract: Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such dataset… ▽ More Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tpu △ Less

Submitted 26 May, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

arXiv:2212.06121 [pdf, other]

In Defense of Cross-Encoders for Zero-Shot Retrieval

Authors: Guilherme Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

Abstract: Bi-encoders and cross-encoders are widely used in many state-of-the-art retrieval pipelines. In this work we study the generalization ability of these two types of architectures on a wide range of parameter count on both in-domain and out-of-domain scenarios. We find that the number of parameters and early query-document interactions of cross-encoders play a significant role in the generalization… ▽ More Bi-encoders and cross-encoders are widely used in many state-of-the-art retrieval pipelines. In this work we study the generalization ability of these two types of architectures on a wide range of parameter count on both in-domain and out-of-domain scenarios. We find that the number of parameters and early query-document interactions of cross-encoders play a significant role in the generalization ability of retrieval models. Our experiments show that increasing model size results in marginal gains on in-domain test sets, but much larger gains in new domains never seen during fine-tuning. Furthermore, we show that cross-encoders largely outperform bi-encoders of similar size in several tasks. In the BEIR benchmark, our largest cross-encoder surpasses a state-of-the-art bi-encoder by more than 4 average points. Finally, we show that using bi-encoders as first-stage retrievers provides no gains in comparison to a simpler retriever such as BM25 on out-of-domain tasks. The code is available at https://github.com/guilhermemr04/scaling-zero-shot-retrieval.git △ Less

Submitted 12 December, 2022; originally announced December 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2206.02873

arXiv:2210.14837 [pdf, other]

NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost

Authors: Thales Sales Almeida, Thiago Laitz, João Seródio, Luiz Henrique Bonifacio, Roberto Lotufo, Rodrigo Nogueira

Abstract: The widespread availability of search API's (both free and commercial) brings the promise of increased coverage and quality of search results for metasearch engines, while decreasing the maintenance costs of the crawling and indexing infrastructures. However, merging strategies frequently comprise complex pipelines that require careful tuning, which is often overlooked in the literature. In this w… ▽ More The widespread availability of search API's (both free and commercial) brings the promise of increased coverage and quality of search results for metasearch engines, while decreasing the maintenance costs of the crawling and indexing infrastructures. However, merging strategies frequently comprise complex pipelines that require careful tuning, which is often overlooked in the literature. In this work, we describe NeuralSearchX, a metasearch engine based on a multi-purpose large reranking model to merge results and highlight sentences. Due to the homogeneity of our architecture, we could focus our optimization efforts on a single component. We compare our system with Microsoft's Biomedical Search and show that our design choices led to a much cost-effective system with competitive QPS while having close to state-of-the-art results on a wide range of public benchmarks. Human evaluation on two domain-specific tasks shows that our retrieval system outperformed Google API by a large margin in terms of nDCG@10 scores. By describing our architecture and implementation in detail, we hope that the community will build on our design choices. The system is available at https://neuralsearchx.nsx.ai. △ Less

Submitted 26 October, 2022; originally announced October 2022.

Comments: published as a full paper at the DESIRES 2022 Conference. 13 pages

Journal ref: DESIRES 2022-3rd International Conference on Design of Experimental Search and Information REtrieval Systems, 30-31,August 2022, San Jose, CA, USA

arXiv:2206.02873 [pdf, other]

No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval

Authors: Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

Abstract: Recent work has shown that small distilled language models are strong competitors to models that are orders of magnitude larger and slower in a wide range of information retrieval tasks. This has made distilled and dense models, due to latency constraints, the go-to choice for deployment in real-world retrieval applications. In this work, we question this practice by showing that the number of par… ▽ More Recent work has shown that small distilled language models are strong competitors to models that are orders of magnitude larger and slower in a wide range of information retrieval tasks. This has made distilled and dense models, due to latency constraints, the go-to choice for deployment in real-world retrieval applications. In this work, we question this practice by showing that the number of parameters and early query-document interaction play a significant role in the generalization ability of retrieval models. Our experiments show that increasing model size results in marginal gains on in-domain test sets, but much larger gains in new domains never seen during fine-tuning. Furthermore, we show that rerankers largely outperform dense ones of similar size in several tasks. Our largest reranker reaches the state of the art in 12 of the 18 datasets of the Benchmark-IR (BEIR) and surpasses the previous state of the art by 3 average points. Finally, we confirm that in-domain effectiveness is not a good indicator of zero-shot effectiveness. Code is available at https://github.com/guilhermemr04/scaling-zero-shot-retrieval.git △ Less

Submitted 12 December, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

arXiv:2205.15172 [pdf, ps, other]

Billions of Parameters Are Worth More Than In-domain Training Data: A case study in the Legal Case Entailment Task

Authors: Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo Abonizio, Roberto Lotufo, Rodrigo Nogueira

Abstract: Recent work has shown that language models scaled to billions of parameters, such as GPT-3, perform remarkably well in zero-shot and few-shot scenarios. In this work, we experiment with zero-shot models in the legal case entailment task of the COLIEE 2022 competition. Our experiments show that scaling the number of parameters in a language model improves the F1 score of our previous zero-shot resu… ▽ More Recent work has shown that language models scaled to billions of parameters, such as GPT-3, perform remarkably well in zero-shot and few-shot scenarios. In this work, we experiment with zero-shot models in the legal case entailment task of the COLIEE 2022 competition. Our experiments show that scaling the number of parameters in a language model improves the F1 score of our previous zero-shot result by more than 6 points, suggesting that stronger zero-shot capability may be a characteristic of larger models, at least for this task. Our 3B-parameter zero-shot model outperforms all models, including ensembles, in the COLIEE 2021 test set and also achieves the best performance of a single model in the COLIEE 2022 competition, second only to the ensemble composed of the 3B model itself and a smaller version of the same model. Despite the challenges posed by large language models, mainly due to latency constraints in real-time applications, we provide a demonstration of our zero-shot monoT5-3b model being used in production as a search engine, including for legal documents. The code for our submission and the demo of our system are available at https://github.com/neuralmind-ai/coliee and https://neuralsearchx.neuralmind.ai, respectively. △ Less

Submitted 30 May, 2022; originally announced May 2022.

arXiv:2202.05144 [pdf, other]

InPars: Data Augmentation for Information Retrieval using Large Language Models

Authors: Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Rodrigo Nogueira

Abstract: The information retrieval community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has show… ▽ More The information retrieval community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has shown that using domain-specific training data, as opposed to a general-purpose one, improves the performance of neural models. In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks. We show that models finetuned solely on our unsupervised dataset outperform strong baselines such as BM25 as well as recently proposed self-supervised dense retrieval methods. Furthermore, retrievers finetuned on both supervised and our synthetic data achieve better zero-shot transfer than models finetuned only on supervised data. Code, models, and data are available at https://github.com/zetaalphavector/inpars . △ Less

Submitted 10 February, 2022; originally announced February 2022.

arXiv:2108.13897 [pdf, other]

mMARCO: A Multilingual Version of the MS MARCO Passage Ranking Dataset

Authors: Luiz Bonifacio, Vitor Jeronymo, Hugo Queiroz Abonizio, Israel Campiotti, Marzieh Fadaee, Roberto Lotufo, Rodrigo Nogueira

Abstract: The MS MARCO ranking dataset has been widely used for training deep learning models for IR tasks, achieving considerable effectiveness on diverse zero-shot scenarios. However, this type of resource is scarce in languages other than English. In this work, we present mMARCO, a multilingual version of the MS MARCO passage ranking dataset comprising 13 languages that was created using machine translat… ▽ More The MS MARCO ranking dataset has been widely used for training deep learning models for IR tasks, achieving considerable effectiveness on diverse zero-shot scenarios. However, this type of resource is scarce in languages other than English. In this work, we present mMARCO, a multilingual version of the MS MARCO passage ranking dataset comprising 13 languages that was created using machine translation. We evaluated mMARCO by finetuning monolingual and multilingual reranking models, as well as a multilingual dense retrieval model on this dataset. We also evaluated models finetuned using the mMARCO dataset in a zero-shot scenario on Mr. TyDi dataset, demonstrating that multilingual models finetuned on our translated dataset achieve superior effectiveness to models finetuned on the original English version alone. Our experiments also show that a distilled multilingual reranker is competitive with non-distilled models while having 5.4 times fewer parameters. Lastly, we show a positive correlation between translation quality and retrieval effectiveness, providing evidence that improvements in translation methods might lead to improvements in multilingual information retrieval. The translated datasets and finetuned models are available at https://github.com/unicamp-dl/mMARCO. △ Less

Submitted 17 August, 2022; v1 submitted 31 August, 2021; originally announced August 2021.

arXiv:2107.11421 [pdf, ps, other]

Testing Pushdown Systems

Authors: Adilson Luiz Bonifacio, Arnaldo Vieira Moura

Abstract: Testing on reactive systems is a well-known laborious activity on software development due to their asynchronous interaction with the environment. In this setting model based testing has been employed when checking conformance and generating test suites of such systems using labeled transition system as a formalism as well as the classical ioco conformance relation. In this work we turn to a more… ▽ More Testing on reactive systems is a well-known laborious activity on software development due to their asynchronous interaction with the environment. In this setting model based testing has been employed when checking conformance and generating test suites of such systems using labeled transition system as a formalism as well as the classical ioco conformance relation. In this work we turn to a more complex scenario where the target systems have an auxiliary memory, a stack. We then studied a more powerful model, the Visibly Pushdown Labeled Transition System (VPTS), its variant Input/Output VPTS (IOVPTS), its associated model Visibly Pushdown Automaton (VPA), and aspects of conformance testing and test suite generation. This scenario is much more challenge since the base model has a pushdown stack to capture more complex behaviors which commonly found on reactive systems. We then defined a more general conformance relation for pushdown reactive systems such that it prevents any observable implementation behavior that was not already present in the given specification. Further we gave an efficient algorithm to check conformance in this testing scenario and also showed that it runs in worst case asymptotic polynomial time in the size of both the given specification and the implementation that are put under test. △ Less

Submitted 23 July, 2021; originally announced July 2021.

arXiv:2105.06813 [pdf, other]

A cost-benefit analysis of cross-lingual transfer methods

Authors: Guilherme Moraes Rosa, Luiz Henrique Bonifacio, Leandro Rodrigues de Souza, Roberto Lotufo, Rodrigo Nogueira

Abstract: An effective method for cross-lingual transfer is to fine-tune a bilingual or multilingual model on a supervised dataset in one language and evaluating it on another language in a zero-shot manner. Translating examples at training time or inference time are also viable alternatives. However, there are costs associated with these methods that are rarely addressed in the literature. In this work, we… ▽ More An effective method for cross-lingual transfer is to fine-tune a bilingual or multilingual model on a supervised dataset in one language and evaluating it on another language in a zero-shot manner. Translating examples at training time or inference time are also viable alternatives. However, there are costs associated with these methods that are rarely addressed in the literature. In this work, we analyze cross-lingual methods in terms of their effectiveness (e.g., accuracy), development and deployment costs, as well as their latencies at inference time. Our experiments on three tasks indicate that the best cross-lingual method is highly task-dependent. Finally, by combining zero-shot and translation methods, we achieve the state-of-the-art in two of the three datasets used in this work. Based on these results, we question the need for manually labeled training data in a target language. Code and translated datasets are available at https://github.com/unicamp-dl/cross-lingual-analysis △ Less

Submitted 14 December, 2021; v1 submitted 14 May, 2021; originally announced May 2021.

arXiv:2011.00389 [pdf, other]

A Model-Based Testing Tool for Asynchronous Reactive Systems

Authors: Adilson Luiz Bonifacio, Camila Sonoda Gomes

Abstract: Reactive systems are characterized by the interaction with the environment, where the exchange of the input and output stimuli, usually, occurs asynchronously. Systems of this nature, in general, require a rigorous testing activity over their developing process. Therefore model-based testing has been successfully applied over asynchronous reactive systems using Input Output Labeled Transition Syst… ▽ More Reactive systems are characterized by the interaction with the environment, where the exchange of the input and output stimuli, usually, occurs asynchronously. Systems of this nature, in general, require a rigorous testing activity over their developing process. Therefore model-based testing has been successfully applied over asynchronous reactive systems using Input Output Labeled Transition Systems (IOLTSs) as the basis. In this work we present a reactive testing tool to check conformance, generate test suites and run test cases using IOLTS models. Our tool can check whether the behavior of an implementation under test (IUT) complies with the behavior of its respective specification. We have implemented a classical conformance relation \ioco and a more general notion of conformance based on regular languages. Further, the tool provides a test suite generation in a black-box testing setting for finding faults over IUTs according to a specific domain. We have also described some case studies to probe the tool's functionalities and also to highlight a comparative analysis on both conformance approaches. Finally, we offer experiments to evaluate the performance of our tool using several scenarios. △ Less

Submitted 31 October, 2020; originally announced November 2020.

Comments: 19 pages, 14 figures, 2 tables

arXiv:1905.08914 [pdf, other]

Automatically Checking Conformance on Asynchronous Reactive Systems

Authors: Camila Sonada Gomes, Adilson Luiz Bonifacio

Abstract: Software testing is an important issue in software development process to ensure higher quality on the products. Formal methods has been promising on testing reactive systems, specially critical systems, where accuracy is mandatory since any fault can cause severe damage. Systems of this nature are characterized by receiving messages from the environment and producing outputs in response. One of t… ▽ More Software testing is an important issue in software development process to ensure higher quality on the products. Formal methods has been promising on testing reactive systems, specially critical systems, where accuracy is mandatory since any fault can cause severe damage. Systems of this nature are characterized by receiving messages from the environment and producing outputs in response. One of the most challenges in model-based testing is the conformance checking of asynchronous reactive systems. The aim is to verify if an implementation is in compliance with its respective specification. In this work, we develop a practical tool to check conformance relation between reactive models using a more general theory based on regular languages. The approach, in fact, subsumes the classical \ioco conformance which is also available in our tool. In addition, we present some studies with different sceneries that are applied to practical tools with both notions of conformance. △ Less

Submitted 10 August, 2019; v1 submitted 21 May, 2019; originally announced May 2019.

Comments: 18 pages, 9 figures, 5 algorithms

MSC Class: 68N30

arXiv:1902.10278 [pdf, ps, other]

A conformance relation and complete test suites for I/O systems

Authors: Adilson Luiz Bonifacio, Arnaldo Vieira Moura

Abstract: Model based testing is a well-established approach to verify implementations modeled by I/O labeled transition systems (IOLTSs). One of the challenges stemming from model based testing is the conformance checking and the generation of test suites, specially when completeness is a required property. In order to check whether an implementation under test is in compliance with its respective specific… ▽ More Model based testing is a well-established approach to verify implementations modeled by I/O labeled transition systems (IOLTSs). One of the challenges stemming from model based testing is the conformance checking and the generation of test suites, specially when completeness is a required property. In order to check whether an implementation under test is in compliance with its respective specification one resorts to some form of conformance relation that guarantees the expected behavior of the implementations, given the behavior of the specification. The ioco conformance relation is an example of such a relation, specially suited for asynchronous models. In this work we study a more general conformance relation, show how to generate finite and complete test suites, and discuss the complexity of the test generation mechanism under this more general conformance relation. We also show that ioco conformance is a special case of this new conformance relation, and we investigate the complexity of classical ioco-complete test suites. Further, we relate our contributions to more recent works, accommodating the restrictions of their classes of fault models within our more general approach as special cases, and expose the complexity of generating any complete test suite that must satisfy their restrictions. △ Less

Submitted 10 February, 2020; v1 submitted 7 February, 2019; originally announced February 2019.

Comments: 44 pages, 20 figures

arXiv:1809.01103 [pdf, other]

An automatic tool for checking multi-party contracts

Authors: Adilson Luiz Bonifacio, Wellington Aparecido Della Mura

Abstract: Contracts play an important role in business where relationships among different parties are dictated by legal rules. The notion of electronic contracts has emerged mostly due to technological advances and the electronic trading among companies and customers. Thereby new challenges have arisen to guarantee reliability among the stakeholders in electronic negotiations. In this scenery, the automati… ▽ More Contracts play an important role in business where relationships among different parties are dictated by legal rules. The notion of electronic contracts has emerged mostly due to technological advances and the electronic trading among companies and customers. Thereby new challenges have arisen to guarantee reliability among the stakeholders in electronic negotiations. In this scenery, the automatic verification of electronic contracts appeared as the solution but as a new challenge at the same time. An important task on verifying contracts is concerned of detecting conflicts in multi-party contracts. The problem of checking contracts has been largely addressed in the literature, but we are not aware about any method and tool that deals with multi-party contracts and conflict detection using a contract language. This work presents an automatic checker, so-called RECALL, for finding conflicts on multi-party contracts modeled by an extension of a contract language. We developed an automatic checking tool and also applied it to a a well-known case study of selling products that is characterized by multi-party aspects of the contracts. We also performed some experiments in order to show the tool performance w.r.t. the size of contracts. △ Less

Submitted 4 September, 2018; originally announced September 2018.

Comments: 28 pages, 26 figures, 3 tables, 2 algorithms

arXiv:1508.02767 [pdf, ps, other]

Intrinsic Properties of Complete Test Suites

Authors: Adilson Luiz Bonifacio, Arnaldo Vieira Moura

Abstract: Completeness is a desirable property of test suites. Roughly, completeness guarantees that a non-equivalent implementation under test will always be identified. Several approaches proposed sufficient, and sometimes also necessary, conditions on the specification model and on the test suite in order to guarantee completeness. Usually, these approaches impose several restrictions on the specificatio… ▽ More Completeness is a desirable property of test suites. Roughly, completeness guarantees that a non-equivalent implementation under test will always be identified. Several approaches proposed sufficient, and sometimes also necessary, conditions on the specification model and on the test suite in order to guarantee completeness. Usually, these approaches impose several restrictions on the specification and on the implementations, such as requiring them to be reduced or complete. Further, test cases are required to be non-blocking --- that is, they must run to completion --- on both the specification and the implementation models. In this work we deal test cases that can be blocking, we define a new notion that captures completeness, and we characterize test suite completeness in this new scenario. We establish an upper bound on the number of states of implementations beyond which no test suite can be complete, both in the classical sense and in the new scenario with blocking test cases. △ Less

Submitted 11 August, 2015; originally announced August 2015.

Showing 1–18 of 18 results for author: Bonifacio, L