-
$β^{4}$-IRT: A New $β^{3}$-IRT with Enhanced Discrimination Estimation
Authors:
Manuel Ferreira-Junior,
Jessica T. S. Reinaldo,
Telmo M. Silva Filho,
Eufrasio A. Lima Neto,
Ricardo B. C. Prudencio
Abstract:
Item response theory aims to estimate respondent's latent skills from their responses in tests composed of items with different levels of difficulty. Several models of item response theory have been proposed for different types of tasks, such as binary or probabilistic responses, response time, multiple responses, among others. In this paper, we propose a new version of $β^3$-IRT, called $β^{4}$-I…
▽ More
Item response theory aims to estimate respondent's latent skills from their responses in tests composed of items with different levels of difficulty. Several models of item response theory have been proposed for different types of tasks, such as binary or probabilistic responses, response time, multiple responses, among others. In this paper, we propose a new version of $β^3$-IRT, called $β^{4}$-IRT, which uses the gradient descent method to estimate the model parameters. In $β^3$-IRT, abilities and difficulties are bounded, thus we employ link functions in order to turn $β^{4}$-IRT into an unconstrained gradient descent process. The original $β^3$-IRT had a symmetry problem, meaning that, if an item was initialised with a discrimination value with the wrong sign, e.g. negative when the actual discrimination should be positive, the fitting process could be unable to recover the correct discrimination and difficulty values for the item. In order to tackle this limitation, we modelled the discrimination parameter as the product of two new parameters, one corresponding to the sign and the second associated to the magnitude. We also proposed sensible priors for all parameters. We performed experiments to compare $β^{4}$-IRT and $β^3$-IRT regarding parameter recovery and our new version outperformed the original $β^3$-IRT. Finally, we made $β^{4}$-IRT publicly available as a Python package, along with the implementation of $β^3$-IRT used in our experiments.
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
Explanation-by-Example Based on Item Response Theory
Authors:
Lucas F. F. Cardoso,
José de S. Ribeiro,
Vitor C. A. Santos,
Raíssa L. Silva,
Marcelle P. Mota,
Ricardo B. C. Prudêncio,
Ronnie C. O. Alves
Abstract:
Intelligent systems that use Machine Learning classification algorithms are increasingly common in everyday society. However, many systems use black-box models that do not have characteristics that allow for self-explanation of their predictions. This situation leads researchers in the field and society to the following question: How can I trust the prediction of a model I cannot understand? In th…
▽ More
Intelligent systems that use Machine Learning classification algorithms are increasingly common in everyday society. However, many systems use black-box models that do not have characteristics that allow for self-explanation of their predictions. This situation leads researchers in the field and society to the following question: How can I trust the prediction of a model I cannot understand? In this sense, XAI emerges as a field of AI that aims to create techniques capable of explaining the decisions of the classifier to the end-user. As a result, several techniques have emerged, such as Explanation-by-Example, which has a few initiatives consolidated by the community currently working with XAI. This research explores the Item Response Theory (IRT) as a tool to explaining the models and measuring the level of reliability of the Explanation-by-Example approach. To this end, four datasets with different levels of complexity were used, and the Random Forest model was used as a hypothesis test. From the test set, 83.8% of the errors are from instances in which the IRT points out the model as unreliable.
△ Less
Submitted 4 October, 2022;
originally announced October 2022.
-
Label noise detection under the Noise at Random model with ensemble filters
Authors:
Kecia G. Moura,
Ricardo B. C. Prudêncio,
George D. C. Cavalcanti
Abstract:
Label noise detection has been widely studied in Machine Learning because of its importance in improving training data quality. Satisfactory noise detection has been achieved by adopting ensembles of classifiers. In this approach, an instance is assigned as mislabeled if a high proportion of members in the pool misclassifies it. Previous authors have empirically evaluated this approach; neverthele…
▽ More
Label noise detection has been widely studied in Machine Learning because of its importance in improving training data quality. Satisfactory noise detection has been achieved by adopting ensembles of classifiers. In this approach, an instance is assigned as mislabeled if a high proportion of members in the pool misclassifies it. Previous authors have empirically evaluated this approach; nevertheless, they mostly assumed that label noise is generated completely at random in a dataset. This is a strong assumption since other types of label noise are feasible in practice and can influence noise detection results. This work investigates the performance of ensemble noise detection under two different noise models: the Noisy at Random (NAR), in which the probability of label noise depends on the instance class, in comparison to the Noisy Completely at Random model, in which the probability of label noise is entirely independent. In this setting, we investigate the effect of class distribution on noise detection performance since it changes the total noise level observed in a dataset under the NAR assumption. Further, an evaluation of the ensemble vote threshold is conducted to contrast with the most common approaches in the literature. In many performed experiments, choosing a noise generation model over another can lead to different results when considering aspects such as class imbalance and noise level ratio among different classes.
△ Less
Submitted 2 December, 2021;
originally announced December 2021.
-
Data vs classifiers, who wins?
Authors:
Lucas F. F. Cardoso,
Vitor C. A. Santos,
Regiane S. Kawasaki Francês,
Ricardo B. C. Prudêncio,
Ronnie C. O. Alves
Abstract:
The experiments covered by Machine Learning (ML) must consider two important aspects to assess the performance of a model: datasets and algorithms. Robust benchmarks are needed to evaluate the best classifiers. For this, one can adopt gold standard benchmarks available in public repositories. However, it is common not to consider the complexity of the dataset when evaluating. This work proposes a…
▽ More
The experiments covered by Machine Learning (ML) must consider two important aspects to assess the performance of a model: datasets and algorithms. Robust benchmarks are needed to evaluate the best classifiers. For this, one can adopt gold standard benchmarks available in public repositories. However, it is common not to consider the complexity of the dataset when evaluating. This work proposes a new assessment methodology based on the combination of Item Response Theory (IRT) and Glicko-2, a rating system mechanism generally adopted to assess the strength of players (e.g., chess). For each dataset in a benchmark, the IRT is used to estimate the ability of classifiers, where good classifiers have good predictions for the most difficult test instances. Tournaments are then run for each pair of classifiers so that Glicko-2 updates performance information such as rating value, rating deviation and volatility for each classifier. A case study was conducted hereby which adopted the OpenML-CC18 benchmark as the collection of datasets and pool of various classification algorithms for evaluation. Not all datasets were observed to be really useful for evaluating algorithms, where only 10% were considered really difficult. Furthermore, the existence of a subset containing only 50% of the original amount of OpenML-CC18 was verified, which is equally useful for algorithm evaluation. Regarding the algorithms, the methodology proposed herein identified the Random Forest as the algorithm with the best innate ability.
△ Less
Submitted 1 November, 2021; v1 submitted 15 July, 2021;
originally announced July 2021.
-
Decoding machine learning benchmarks
Authors:
Lucas F. F. Cardoso,
Vitor C. A. Santos,
Regiane S. K. Francês,
Ricardo B. C. Prudêncio,
Ronnie C. O. Alves
Abstract:
Despite the availability of benchmark machine learning (ML) repositories (e.g., UCI, OpenML), there is no standard evaluation strategy yet capable of pointing out which is the best set of datasets to serve as gold standard to test different ML algorithms. In recent studies, Item Response Theory (IRT) has emerged as a new approach to elucidate what should be a good ML benchmark. This work applied I…
▽ More
Despite the availability of benchmark machine learning (ML) repositories (e.g., UCI, OpenML), there is no standard evaluation strategy yet capable of pointing out which is the best set of datasets to serve as gold standard to test different ML algorithms. In recent studies, Item Response Theory (IRT) has emerged as a new approach to elucidate what should be a good ML benchmark. This work applied IRT to explore the well-known OpenML-CC18 benchmark to identify how suitable it is on the evaluation of classifiers. Several classifiers ranging from classical to ensembles ones were evaluated using IRT models, which could simultaneously estimate dataset difficulty and classifiers' ability. The Glicko-2 rating system was applied on the top of IRT to summarize the innate ability and aptitude of classifiers. It was observed that not all datasets from OpenML-CC18 are really useful to evaluate classifiers. Most datasets evaluated in this work (84%) contain easy instances in general (e.g., around 10% of difficult instances only). Also, 80% of the instances in half of this benchmark are very discriminating ones, which can be of great use for pairwise algorithm comparison, but not useful to push classifiers abilities. This paper presents this new evaluation methodology based on IRT as well as the tool decodIRT, developed to guide IRT estimation over ML benchmarks.
△ Less
Submitted 19 August, 2020; v1 submitted 29 July, 2020;
originally announced July 2020.
-
$β^3$-IRT: A New Item Response Model and its Applications
Authors:
Yu Chen,
Telmo Silva Filho,
Ricardo B. C. Prudêncio,
Tom Diethe,
Peter Flach
Abstract:
Item Response Theory (IRT) aims to assess latent abilities of respondents based on the correctness of their answers in aptitude test items with different difficulty levels. In this paper, we propose the $β^3$-IRT model, which models continuous responses and can generate a much enriched family of Item Characteristic Curve (ICC). In experiments we applied the proposed model to data from an online ex…
▽ More
Item Response Theory (IRT) aims to assess latent abilities of respondents based on the correctness of their answers in aptitude test items with different difficulty levels. In this paper, we propose the $β^3$-IRT model, which models continuous responses and can generate a much enriched family of Item Characteristic Curve (ICC). In experiments we applied the proposed model to data from an online exam platform, and show our model outperforms a more standard 2PL-ND model on all datasets. Furthermore, we show how to apply $β^3$-IRT to assess the ability of machine learning classifiers. This novel application results in a new metric for evaluating the quality of the classifier's probability estimates, based on the inferred difficulty and discrimination of data instances.
△ Less
Submitted 3 June, 2019; v1 submitted 10 March, 2019;
originally announced March 2019.