-
Association measures for interval variables
Authors:
M. Rosário Oliveira,
Margarida Azeitona,
António Pacheco,
Rui Valadas
Abstract:
Symbolic Data Analysis (SDA) is a relatively new field of statistics that extends conventional data analysis by taking into account intrinsic data variability and structure. Unlike conventional data analysis, in SDA the features characterizing the data can be multi-valued, such as intervals or histograms. SDA has been mainly approached from a sampling perspective. In this work, we propose a model…
▽ More
Symbolic Data Analysis (SDA) is a relatively new field of statistics that extends conventional data analysis by taking into account intrinsic data variability and structure. Unlike conventional data analysis, in SDA the features characterizing the data can be multi-valued, such as intervals or histograms. SDA has been mainly approached from a sampling perspective. In this work, we propose a model that links the micro-data and macro-data of interval-valued symbolic variables, which takes a populational perspective. Using this model, we derive the micro-data assumptions underlying the various definitions of symbolic covariance matrices proposed in the literature, and show that these assumptions can be too restrictive, raising applicability concerns. We analyze the various definitions using worked examples and four datasets. Our results show that the existence/absence of correlations in the macro-data may not be correctly captured by the definitions of symbolic covariance matrices and that, in real data, there can be a strong divergence between these definitions. Thus, in order to select the most appropriate definition, one must have some knowledge about the micro-data structure.
△ Less
Submitted 26 January, 2021; v1 submitted 15 October, 2018;
originally announced October 2018.
-
Theoretical Foundations of Forward Feature Selection Methods based on Mutual Information
Authors:
Francisco Macedo,
M. Rosário Oliveira,
António Pacheco,
Rui Valadas
Abstract:
Feature selection problems arise in a variety of applications, such as microarray analysis, clinical prediction, text categorization, image classification and face recognition, multi-label learning, and classification of internet traffic. Among the various classes of methods, forward feature selection methods based on mutual information have become very popular and are widely used in practice. How…
▽ More
Feature selection problems arise in a variety of applications, such as microarray analysis, clinical prediction, text categorization, image classification and face recognition, multi-label learning, and classification of internet traffic. Among the various classes of methods, forward feature selection methods based on mutual information have become very popular and are widely used in practice. However, comparative evaluations of these methods have been limited by being based on specific datasets and classifiers. In this paper, we develop a theoretical framework that allows evaluating the methods based on their theoretical properties. Our framework is grounded on the properties of the target objective function that the methods try to approximate, and on a novel categorization of features, according to their contribution to the explanation of the class; we derive upper and lower bounds for the target objective function and relate these bounds with the feature types. Then, we characterize the types of approximations taken by the methods, and analyze how these approximations cope with the good properties of the target objective function. Additionally, we develop a distributional setting designed to illustrate the various deficiencies of the methods, and provide several examples of wrong feature selections. Based on our work, we identify clearly the methods that should be avoided, and the methods that currently have the best performance.
△ Less
Submitted 14 February, 2018; v1 submitted 26 January, 2017;
originally announced January 2017.
-
Theoretical Evaluation of Feature Selection Methods based on Mutual Information
Authors:
Cláudia Pascoal,
M. Rosário Oliveira,
António Pacheco,
Rui Valadas
Abstract:
Feature selection methods are usually evaluated by wrapping specific classifiers and datasets in the evaluation process, resulting very often in unfair comparisons between methods. In this work, we develop a theoretical framework that allows obtaining the true feature ordering of two-dimensional sequential forward feature selection methods based on mutual information, which is independent of entro…
▽ More
Feature selection methods are usually evaluated by wrapping specific classifiers and datasets in the evaluation process, resulting very often in unfair comparisons between methods. In this work, we develop a theoretical framework that allows obtaining the true feature ordering of two-dimensional sequential forward feature selection methods based on mutual information, which is independent of entropy or mutual information estimation methods, classifiers, or datasets, and leads to an undoubtful comparison of the methods. Moreover, the theoretical framework unveils problems intrinsic to some methods that are otherwise difficult to detect, namely inconsistencies in the construction of the objective function used to select the candidate features, due to various types of indeterminations and to the possibility of the entropy of continuous random variables taking null and negative values.
△ Less
Submitted 7 October, 2016; v1 submitted 21 September, 2016;
originally announced September 2016.
-
The cost of not having a perfect reference in diagnostic accuracy studies: theoretical results and a web visualisation tool
Authors:
Ana Subtil,
Maria Rosário Oliveira,
António Pacheco
Abstract:
Dichotomous diagnostic tests are widely used to detect the presence or absence of a biomedical condition of interest. A rigorous evaluation of the accuracy of a diagnostic test is critical to determine its practical value. Performance measures, such as the sensitivity and specificity of the test, should be estimated by comparison with a gold standard. Since an error-free reference test is frequent…
▽ More
Dichotomous diagnostic tests are widely used to detect the presence or absence of a biomedical condition of interest. A rigorous evaluation of the accuracy of a diagnostic test is critical to determine its practical value. Performance measures, such as the sensitivity and specificity of the test, should be estimated by comparison with a gold standard. Since an error-free reference test is frequently missing, approaches based on available imperfect diagnostic tests are used, namely: comparisons with an imperfect gold standard or with a composite reference standard, discrepant analysis, and latent class models.
In this work, we compare these methods using a theoretical approach based on analytical expressions for the deviations between the sensitivity and specificity according to each method, and the corresponding true values. We explore the impact on the deviations of varying conditions: tests sensitivities and specificities, prevalence of the condition and local dependence between the tests. An R interactive graphical application is made available for the visualisation of the outcomes. Based on our findings, we discuss the methods validity and potential usefulness.
△ Less
Submitted 21 September, 2016; v1 submitted 23 August, 2016;
originally announced August 2016.
-
The zero-inflated promotion cure rate regression model applied to fraud propensity in bank loan applications
Authors:
Francisco Louzada,
Mauro R. de Oliveira Jr,
Fernando F. Moreira
Abstract:
In this paper we extend the promotion cure rate model proposed by Chen et al (1999), by incorporating excess of zeros in the modelling. Despite allowing to relate the covariates to the fraction of cure, the current approach, which is based on a biological interpretation of the causes that trigger the event of interest, does not enable to relate the covariates to the fraction of zeros. The presence…
▽ More
In this paper we extend the promotion cure rate model proposed by Chen et al (1999), by incorporating excess of zeros in the modelling. Despite allowing to relate the covariates to the fraction of cure, the current approach, which is based on a biological interpretation of the causes that trigger the event of interest, does not enable to relate the covariates to the fraction of zeros. The presence of zeros in survival data, unusual in medical studies, can frequently occur in banking loan portfolios, as presented in Louzada et al (2015), where they deal with propensity to fraud in lending loans in a major Brazilian bank. To illustrate the new cure rate survival method, the same real dataset analyzed in Louzada et al (2015) is fitted here, and the results are compared.
△ Less
Submitted 1 October, 2015;
originally announced October 2015.
-
The zero-inflated cure rate regression model: Applications to fraud detection in bank loan portfolios
Authors:
Francisco Louzada,
Mauro R. de Oliveira Jr,
Fernando F. Moreira
Abstract:
In this paper, we introduce a methodology based on the zero-inflated cure rate model to detect fraudsters in bank loan applications. In fact, our approach enables us to accommodate three different types of loan applicants, i.e., fraudsters, those who are susceptible to default and finally, those who are not susceptible to default. An advantage of our approach is to accommodate zero-inflated times,…
▽ More
In this paper, we introduce a methodology based on the zero-inflated cure rate model to detect fraudsters in bank loan applications. In fact, our approach enables us to accommodate three different types of loan applicants, i.e., fraudsters, those who are susceptible to default and finally, those who are not susceptible to default. An advantage of our approach is to accommodate zero-inflated times, which is not possible in the standard cure rate model. To illustrate the proposed method, a real dataset of loan survival times is fitted by the zero-inflated Weibull cure rate model. The parameter estimation is reached by maximum likelihood estimation procedure and Monte Carlo simulations are carried out to check its finite sample performance.
△ Less
Submitted 19 September, 2015; v1 submitted 17 September, 2015;
originally announced September 2015.
-
Recovery Risk: Application of the Latent Competing Risks Model to Non performing Loans
Authors:
Mauro R. Oliveira,
Francisco Louzada
Abstract:
This article proposes a method for measuring the latent risks involved in the recovery process of non performing loans in financial institutions and business firms that deal with collection and recovery processes. To that end, we apply the competing risks model referred to in the literature as the promotion time model. The result achieved is the probability of credit recovery for a portfolio segme…
▽ More
This article proposes a method for measuring the latent risks involved in the recovery process of non performing loans in financial institutions and business firms that deal with collection and recovery processes. To that end, we apply the competing risks model referred to in the literature as the promotion time model. The result achieved is the probability of credit recovery for a portfolio segmented into groups based on the information available. Within the context of competing risks, application of the technique yielded an estimation of the number of latent events that concur to the credit recovery event. With these results in hand, we were able to compare groups of defaulters in terms of risk or susceptibility to the recovery event during the collection process, and thereby determine where collection actions are most efficient. We specify the Poisson distribution for the number of latent causes leading to recovery, and the Weibull distribution for the time up to recovery. To estimate the model parameters, we use the maximum likelihood method. Finally, the model was applied to a sample of defaulted loans from a financial institution.
△ Less
Submitted 19 August, 2014;
originally announced August 2014.
-
An Evidence of Link between Default and Loss of Bank Loans from the Modeling of Competing Risks
Authors:
Mauro R. Oliveira,
Francisco Louzada
Abstract:
In this paper, we propose a method that provides a useful technique to compare relationship between risks involved that takes customer become defaulter and debt collection process that might make this defaulter recovered. Through estimation of competitive risks that lead to realization of the event of interest, we showed that there is a significant relation between the intensity of default and los…
▽ More
In this paper, we propose a method that provides a useful technique to compare relationship between risks involved that takes customer become defaulter and debt collection process that might make this defaulter recovered. Through estimation of competitive risks that lead to realization of the event of interest, we showed that there is a significant relation between the intensity of default and losses from defaulted loans in collection processes. To reach this goal, we investigate a competing risks model applied to whole credit risk cycle into a bank loans portfolio. We estimated competing causes related to occurrence of default, thereafter, comparing it with estimated competing causes that lead loans to write-off condition. In context of modeling competing risks, we used a specification of Poisson distribution for numbers from competing causes and Weibull distribution for failures times. The likelihood maximum estimation is used to parameters estimation and the model is applied to a real data of personal loans
△ Less
Submitted 19 August, 2014;
originally announced August 2014.