-
A Typology of Data Anomalies
Authors:
Ralph Foorthuis
Abstract:
Anomalies are cases that are in some way unusual and do not appear to fit the general patterns present in the dataset. Several conceptualizations exist to distinguish between different types of anomalies. However, these are either too specific to be generally applicable or so abstract that they neither provide concrete insight into the nature of anomaly types nor facilitate the functional evaluati…
▽ More
Anomalies are cases that are in some way unusual and do not appear to fit the general patterns present in the dataset. Several conceptualizations exist to distinguish between different types of anomalies. However, these are either too specific to be generally applicable or so abstract that they neither provide concrete insight into the nature of anomaly types nor facilitate the functional evaluation of anomaly detection algorithms. With the recent criticism on 'black box' algorithms and analytics it has become clear that this is an undesirable situation. This paper therefore introduces a general typology of anomalies that offers a clear and tangible definition of the different types of anomalies in datasets. The typology also facilitates the evaluation of the functional capabilities of anomaly detection algorithms and as a framework assists in analyzing the conceptual levels of data, patterns and anomalies. Finally, it serves as an analytical tool for studying anomaly types from other typologies.
△ Less
Submitted 4 July, 2021;
originally announced July 2021.
-
Algorithmic Frameworks for the Detection of High Density Anomalies
Authors:
Ralph Foorthuis
Abstract:
This study explores the concept of high-density anomalies. As opposed to the traditional concept of anomalies as isolated occurrences, high-density anomalies are deviant cases positioned in the most normal regions of the data space. Such anomalies are relevant for various practical use cases, such as misbehavior detection and data quality analysis. Effective methods for identifying them are partic…
▽ More
This study explores the concept of high-density anomalies. As opposed to the traditional concept of anomalies as isolated occurrences, high-density anomalies are deviant cases positioned in the most normal regions of the data space. Such anomalies are relevant for various practical use cases, such as misbehavior detection and data quality analysis. Effective methods for identifying them are particularly important when analyzing very large or noisy sets, for which traditional anomaly detection algorithms will return many false positives. In order to be able to identify high-density anomalies, this study introduces several non-parametric algorithmic frameworks for unsupervised detection. These frameworks are able to leverage existing underlying anomaly detection algorithms and offer different solutions for the balancing problem inherent in this detection task. The frameworks are evaluated with both synthetic and real-world datasets, and are compared with existing baseline algorithms for detecting traditional anomalies. The Iterative Partial Push (IPP) framework proves to yield the best detection results.
△ Less
Submitted 4 April, 2021; v1 submitted 9 October, 2020;
originally announced October 2020.
-
The Impact of Discretization Method on the Detection of Six Types of Anomalies in Datasets
Authors:
Ralph Foorthuis
Abstract:
Anomaly detection is the process of identifying cases, or groups of cases, that are in some way unusual and do not fit the general patterns present in the dataset. Numerous algorithms use discretization of numerical data in their detection processes. This study investigates the effect of the discretization method on the unsupervised detection of each of the six anomaly types acknowledged in a rece…
▽ More
Anomaly detection is the process of identifying cases, or groups of cases, that are in some way unusual and do not fit the general patterns present in the dataset. Numerous algorithms use discretization of numerical data in their detection processes. This study investigates the effect of the discretization method on the unsupervised detection of each of the six anomaly types acknowledged in a recent typology of data anomalies. To this end, experiments are conducted with various datasets and SECODA, a general-purpose algorithm for unsupervised non-parametric anomaly detection in datasets with numerical and categorical attributes. This algorithm employs discretization of continuous attributes, exponentially increasing weights and discretization cut points, and a pruning heuristic to detect anomalies with an optimal number of iterations. The results demonstrate that standard SECODA can detect all six types, but that different discretization methods favor the discovery of certain anomaly types. The main findings also hold for other detection techniques using discretization.
△ Less
Submitted 27 August, 2020;
originally announced August 2020.
-
On Course, But Not There Yet: Enterprise Architecture Conformance and Benefits in Systems Development
Authors:
Ralph Foorthuis,
Marlies van Steenbergen,
Nino Mushkudiani,
Wiel Bruls,
Sjaak Brinkkemper,
Rik Bos
Abstract:
Various claims have been made regarding the benefits that Enterprise Architecture (EA) delivers for both individual systems development projects and the organization as a whole. This paper presents the statistical findings of a survey study (n=293) carried out to empirically test these claims. First, we investigated which techniques are used in practice to stimulate conformance to EA. Secondly, we…
▽ More
Various claims have been made regarding the benefits that Enterprise Architecture (EA) delivers for both individual systems development projects and the organization as a whole. This paper presents the statistical findings of a survey study (n=293) carried out to empirically test these claims. First, we investigated which techniques are used in practice to stimulate conformance to EA. Secondly, we studied which benefits are actually gained. Thirdly, we verified whether EA creators (e.g. enterprise architects) and EA users (e.g. project members) differ in their perceptions regarding EA. Finally, we investigated which of the applied techniques most effectively increase project conformance to and effectiveness of EA. A multivariate regression analysis demonstrates that three techniques have a major impact on conformance: carrying out compliance assessments, management propagation of EA and providing assistance to projects. Although project conformance plays a central role in reaping various benefits at both the organizational and the project level, it is shown that a number of important benefits have not yet been fully achieved.
△ Less
Submitted 23 August, 2020;
originally announced August 2020.
-
A Theory Building Study of Enterprise Architecture Practices and Benefits
Authors:
Ralph Foorthuis,
Marlies van Steenbergen,
Sjaak Brinkkemper,
Wiel Bruls
Abstract:
Academics and practitioners have made various claims regarding the benefits that Enterprise Architecture (EA) delivers for both individual projects and the organization as a whole. At the same time, there is a lack of explanatory theory regarding how EA delivers these benefits. Moreover, EA practices and benefits have not been extensively investigated by empirical research, with especially quantit…
▽ More
Academics and practitioners have made various claims regarding the benefits that Enterprise Architecture (EA) delivers for both individual projects and the organization as a whole. At the same time, there is a lack of explanatory theory regarding how EA delivers these benefits. Moreover, EA practices and benefits have not been extensively investigated by empirical research, with especially quantitative studies on the topic being few and far between. This paper therefore presents the statistical findings of a theory-building survey study (n=293). The resulting PLS model is a synthesis of current implicit and fragmented theory, and shows how EA practices and intermediate benefits jointly work to help the organization reap benefits for both the organization and its projects. The model shows that EA and EA practices do not deliver benefits directly, but operate through intermediate results, most notably compliance with EA and architectural insight. Furthermore, the research identifies the EA practices that have a major impact on these results, the most important being compliance assessments, management propagation of EA, and different types of knowledge exchange. The results also demonstrate that projects play an important role in obtaining benefits from EA, but that they generally benefit less than the organization as a whole.
△ Less
Submitted 18 August, 2020;
originally announced August 2020.
-
SECODA: Segmentation- and Combination-Based Detection of Anomalies
Authors:
Ralph Foorthuis
Abstract:
This study introduces SECODA, a novel general-purpose unsupervised non-parametric anomaly detection algorithm for datasets containing continuous and categorical attributes. The method is guaranteed to identify cases with unique or sparse combinations of attribute values. Continuous attributes are discretized repeatedly in order to correctly determine the frequency of such value combinations. The c…
▽ More
This study introduces SECODA, a novel general-purpose unsupervised non-parametric anomaly detection algorithm for datasets containing continuous and categorical attributes. The method is guaranteed to identify cases with unique or sparse combinations of attribute values. Continuous attributes are discretized repeatedly in order to correctly determine the frequency of such value combinations. The concept of constellations, exponentially increasing weights and discretization cut points, as well as a pruning heuristic are used to detect anomalies with an optimal number of iterations. Moreover, the algorithm has a low memory imprint and its runtime performance scales linearly with the size of the dataset. An evaluation with simulated and real-life datasets shows that this algorithm is able to identify many different types of anomalies, including complex multidimensional instances. An evaluation in terms of a data quality use case with a real dataset demonstrates that SECODA can bring relevant and practical value to real-world settings.
△ Less
Submitted 16 August, 2020;
originally announced August 2020.
-
Tactics for Internal Compliance: A Literature Review
Authors:
Ralph Foorthuis
Abstract:
Compliance of organizations with internal and external norms is a highly relevant topic for both practitioners and academics nowadays. However, the substantive, elementary compliance tactics that organizations can use for achieving internal compliance have been described in a fragmented manner and in the literatures of distinct academic disciplines. Using a multidisciplinary structured literature…
▽ More
Compliance of organizations with internal and external norms is a highly relevant topic for both practitioners and academics nowadays. However, the substantive, elementary compliance tactics that organizations can use for achieving internal compliance have been described in a fragmented manner and in the literatures of distinct academic disciplines. Using a multidisciplinary structured literature review of 134 publications, this study offers three contributions. First, we present a typology of 45 compliance tactics, which constitutes a comprehensive and rich overview of elementary ways for bringing the organization into compliance. Secondly, we provide an overview of fundamental concepts in the theory of compliance, which forms the basis for the framework we developed for positioning compliance tactics and for analyzing or developing compliance strategies. Thirdly, we present insights for moving from compliance tactics to compliance strategies. In the process, and using the multidisciplinary literature review to take a bird's-eye view, we demonstrate that compliance strategies need to be regarded as a richer concept than perceived hitherto. We also show that opportunities for innovation exist.
△ Less
Submitted 9 August, 2020;
originally announced August 2020.
-
On the Nature and Types of Anomalies: A Review of Deviations in Data
Authors:
Ralph Foorthuis
Abstract:
Anomalies are occurrences in a dataset that are in some way unusual and do not fit the general patterns. The concept of the anomaly is typically ill-defined and perceived as vague and domain-dependent. Moreover, despite some 250 years of publications on the topic, no comprehensive and concrete overviews of the different types of anomalies have hitherto been published. By means of an extensive lite…
▽ More
Anomalies are occurrences in a dataset that are in some way unusual and do not fit the general patterns. The concept of the anomaly is typically ill-defined and perceived as vague and domain-dependent. Moreover, despite some 250 years of publications on the topic, no comprehensive and concrete overviews of the different types of anomalies have hitherto been published. By means of an extensive literature review this study therefore offers the first theoretically principled and domain-independent typology of data anomalies and presents a full overview of anomaly types and subtypes. To concretely define the concept of the anomaly and its different manifestations, the typology employs five dimensions: data type, cardinality of relationship, anomaly level, data structure, and data distribution. These fundamental and data-centric dimensions naturally yield 3 broad groups, 9 basic types, and 63 subtypes of anomalies. The typology facilitates the evaluation of the functional capabilities of anomaly detection algorithms, contributes to explainable data science, and provides insights into relevant topics such as local versus global anomalies.
△ Less
Submitted 29 May, 2023; v1 submitted 30 July, 2020;
originally announced July 2020.