Skip to main content

Showing 1–42 of 42 results for author: Rousseeuw, P

Searching in archive stat. Search in all archives.
.
  1. Kernel Outlier Detection

    Authors: Can Hakan Dağıdır, Mia Hubert, Peter J. Rousseeuw

    Abstract: A new anomaly detection method called kernel outlier detection (KOD) is proposed. It is designed to address challenges of outlier detection in high-dimensional settings. The aim is to overcome limitations of existing methods, such as dependence on distributional assumptions or on hyperparameters that are hard to tune. KOD starts with a kernel transformation, followed by a projection pursuit approa… ▽ More

    Submitted 28 June, 2025; originally announced June 2025.

    Journal ref: Journal of Data Science, Statistics, and Visualisation (2025), Volume 5, Issue 8

  2. arXiv:2505.19925  [pdf, ps, other

    stat.ME cs.LG

    Cellwise and Casewise Robust Covariance in High Dimensions

    Authors: Fabio Centofanti, Mia Hubert, Peter J. Rousseeuw

    Abstract: The sample covariance matrix is a cornerstone of multivariate statistics, but it is highly sensitive to outliers. These can be casewise outliers, such as cases belonging to a different population, or cellwise outliers, which are deviating cells (entries) of the data matrix. Recently some robust covariance estimators have been developed that can handle both types of outliers, but their computation… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  3. arXiv:2505.09425  [pdf, ps, other

    stat.CO cs.LG

    Independent Component Analysis by Robust Distance Correlation

    Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw, Tom Van Deuren, Tim Verdonck

    Abstract: Independent component analysis (ICA) is a powerful tool for decomposing a multivariate signal or distribution into fully independent sources, not just uncorrelated ones. Unfortunately, most approaches to ICA are not robust against outliers. Here we propose a robust ICA method called RICA, which estimates the components by minimizing a robust measure of dependence between multivariate random variab… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  4. arXiv:2412.16980  [pdf, ps, other

    stat.ME stat.AP

    Explainable Linear and Generalized Linear Models by the Predictions Plot

    Authors: Peter J. Rousseeuw

    Abstract: Many statistics courses cover multiple linear regression, and present students with the formula of a prediction using the regressors, slopes, and an intercept. But is it really easy to see which terms have the largest effect, or to explain why the prediction of a specific case is unusually high or low? To assist with this the so-called predictions plot is proposed. Its simplicity makes it easy to… ▽ More

    Submitted 23 June, 2025; v1 submitted 22 December, 2024; originally announced December 2024.

  5. arXiv:2411.01954  [pdf, other

    stat.CO stat.ML

    RobPy: a Python Package for Robust Statistical Methods

    Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw, Thomas Servotte, Tim Verdonck

    Abstract: Robust estimation provides essential tools for analyzing data that contain outliers, ensuring that statistical models remain reliable even in the presence of some anomalous data. While robust methods have long been available in R, users of Python have lacked a comprehensive package that offers these methods in a cohesive framework. RobPy addresses this gap by offering a wide range of robust method… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

  6. arXiv:2408.15701  [pdf, other

    stat.ME stat.CO

    Robust discriminant analysis

    Authors: Mia Hubert, Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: Discriminant analysis (DA) is one of the most popular methods for classification due to its conceptual simplicity, low computational cost, and often solid performance. In its standard form, DA uses the arithmetic mean and sample covariance matrix to estimate the center and scatter of each class. We discuss and illustrate how this makes standard DA very sensitive to suspicious data points, such as… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: Accepted for publication in WIREs Computational Statistics (Wiley Interdisciplinary Reviews)

  7. arXiv:2408.13596  [pdf, other

    stat.ME stat.CO

    Robust Principal Components by Casewise and Cellwise Weighting

    Authors: Fabio Centofanti, Mia Hubert, Peter J. Rousseeuw

    Abstract: Principal component analysis (PCA) is a fundamental tool for analyzing multivariate data. Here the focus is on dimension reduction to the principal subspace, characterized by its projection matrix. The classical principal subspace can be strongly affected by the presence of outliers. Traditional robust approaches consider casewise outliers, that is, cases generated by an unspecified outlier distri… ▽ More

    Submitted 24 August, 2024; originally announced August 2024.

  8. Distance Covariance, Independence, and Pairwise Differences

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: (To appear in The American Statistician.) Distance covariance (Székely, Rizzo, and Bakirov, 2007) is a fascinating recent notion, which is popular as a test for dependence of any type between random variables $X$ and $Y$. This approach deserves to be touched upon in modern courses on mathematical statistics. It makes use of distances of the type $|X-X'|$ and $|Y-Y'|$, where $(X',Y')$ is an indepen… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Journal ref: The American Statistician, 2024

  9. arXiv:2403.03722  [pdf, ps, other

    stat.ME

    Robust Distance Covariance

    Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: Distance covariance is a popular measure of dependence between random variables. It has some robustness properties, but not all. We prove that the influence function of the usual distance covariance is bounded, but that its breakdown value is zero. Moreover, it has an unbounded sensitivity function, converging to the bounded influence function for increasing sample size. To address this sensitivit… ▽ More

    Submitted 12 June, 2025; v1 submitted 6 March, 2024; originally announced March 2024.

    Comments: To appear in International Statistical Review as a discussion paper

  10. Multivariate Singular Spectrum Analysis by Robust Diagonalwise Low-Rank Approximation

    Authors: Fabio Centofanti, Mia Hubert, Biagio Palumbo, Peter J. Rousseeuw

    Abstract: Multivariate Singular Spectrum Analysis (MSSA) is a powerful and widely used nonparametric method for multivariate time series, which allows the analysis of complex temporal data from diverse fields such as finance, healthcare, ecology, and engineering. However, MSSA lacks robustness against outliers because it relies on the singular value decomposition, which is very sensitive to the presence of… ▽ More

    Submitted 2 October, 2023; originally announced October 2023.

    Journal ref: Journal of Computational and Graphical Statistics, 2024

  11. arXiv:2302.03931  [pdf, other

    stat.ML cs.LG stat.ME

    Fast Linear Model Trees by PILOT

    Authors: Jakob Raymaekers, Peter J. Rousseeuw, Tim Verdonck, Ruicong Yao

    Abstract: Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addit… ▽ More

    Submitted 8 February, 2023; originally announced February 2023.

    Journal ref: Machine Learning, 2024

  12. Challenges of cellwise outliers

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: It is well-known that real data often contain outliers. The term outlier typically refers to a case, that is, a row of the $n \times d$ data matrix. In recent times a different type has come into focus, the cellwise outliers. These are suspicious cells (entries) that can occur anywhere in the data matrix. Even a relatively small proportion of outlying cells can contaminate over half the rows, whic… ▽ More

    Submitted 4 February, 2023; originally announced February 2023.

    Journal ref: Econometrics and Statistics, 2024

  13. Analyzing cellwise weighted data

    Authors: Peter J. Rousseeuw

    Abstract: Often the rows (cases, objects) of a dataset have weights. For instance, the weight of a case may reflect the number of times it has been observed, or its reliability. For analyzing such data many rowwise weighted techniques are available, the most well known being the weighted average. But there are also situations where the individual cells (entries) of the data matrix have weights assigned to t… ▽ More

    Submitted 3 January, 2023; v1 submitted 26 September, 2022; originally announced September 2022.

    Journal ref: Econometrics and Statistics, 2024

  14. The Cellwise Minimum Covariance Determinant Estimator

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: The usual Minimum Covariance Determinant (MCD) estimator of a covariance matrix is robust against casewise outliers. These are cases (that is, rows of the data matrix) that behave differently from the majority of cases, raising suspicion that they might belong to a different population. On the other hand, cellwise outliers are individual cells in the data matrix. When a row contains one or more ou… ▽ More

    Submitted 15 November, 2023; v1 submitted 27 July, 2022; originally announced July 2022.

    Journal ref: Journal of the American Statistical Association, 2025

  15. Silhouettes and quasi residual plots for neural nets and tree-based classifiers

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: Classification by neural nets and by tree-based methods are powerful tools of machine learning. There exist interesting visualizations of the inner workings of these and other classifiers. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or… ▽ More

    Submitted 26 February, 2022; v1 submitted 16 June, 2021; originally announced June 2021.

    Journal ref: Journal of Computational and Graphical Statistics 2022, Volume 31, 1332-1343

  16. Real-time discriminant analysis in the presence of label and measurement noise

    Authors: Iwein Vranckx, Jakob Raymaekers, Bart De Ketelaere, Peter J. Rousseeuw, Mia Hubert

    Abstract: Quadratic discriminant analysis (QDA) is a widely used classification technique. Based on a training dataset, each class in the data is characterized by an estimate of its center and shape, which can then be used to assign unseen observations to one of the classes. The traditional QDA rule relies on the empirical mean and covariance matrix. Unfortunately, these estimators are sensitive to label an… ▽ More

    Submitted 10 November, 2020; v1 submitted 29 August, 2020; originally announced August 2020.

    Journal ref: Chemometrics and Intelligent Laboratory Systems, 2021, Volume 208

  17. arXiv:2008.05171  [pdf, other

    cs.LG cs.AI stat.ML

    Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms

    Authors: Erich Schubert, Peter J. Rousseeuw

    Abstract: Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids clustering. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, t… ▽ More

    Submitted 1 June, 2021; v1 submitted 12 August, 2020; originally announced August 2020.

    Journal ref: Information Systems 2021, 101804

  18. arXiv:2008.02046  [pdf, other

    stat.ML cs.LG stat.CO

    Outlier detection in non-elliptical data by kernel MRCD

    Authors: Joachim Schreurs, Iwein Vranckx, Mia Hubert, Johan A. K. Suykens, Peter J. Rousseeuw

    Abstract: The minimum regularized covariance determinant method (MRCD) is a robust estimator for multivariate location and scatter, which detects outliers by fitting a robust covariance matrix to the data. Its regularization ensures that the covariance matrix is well-conditioned in any dimension. The MRCD assumes that the non-outlying observations are roughly elliptically distributed, but many datasets are… ▽ More

    Submitted 29 March, 2021; v1 submitted 5 August, 2020; originally announced August 2020.

    Journal ref: Statistics and Computing, 2021, Volume 31, article 66

  19. arXiv:2007.14495  [pdf, other

    stat.ML cs.LG stat.CO stat.ME

    Class maps for visualizing classification results

    Authors: Jakob Raymaekers, Peter J. Rousseeuw, Mia Hubert

    Abstract: Classification is a major tool of statistics and machine learning. A classification method first processes a training set of objects with given classes (labels), with the goal of afterward assigning new objects to one of these classes. When running the resulting prediction method on the training data or on test data, it can happen that an object is predicted to lie in a class that differs from its… ▽ More

    Submitted 19 May, 2021; v1 submitted 28 July, 2020; originally announced July 2020.

    Comments: Appeared online, Technometrics

    Journal ref: Technometrics 2022, Vol. 64, pages 151-165

  20. Transforming variables to central normality

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: Many real data sets contain numerical features (variables) whose distribution is far from normal (gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box-Cox and Yeo-Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformati… ▽ More

    Submitted 21 November, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

    Journal ref: Machine Learning, 2021

  21. Handling cellwise outliers by sparse regression and robust covariance

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: We propose a data-analytic method for detecting cellwise outliers. Given a robust covariance matrix, outlying cells (entries) in a row are found by the cellHandler technique which combines lasso regression with a stepwise application of constructed cutoff values. The penalty term of the lasso has a physical interpretation as the total distance that suspicious cells need to move in order to bring t… ▽ More

    Submitted 7 December, 2020; v1 submitted 28 December, 2019; originally announced December 2019.

    Journal ref: Journal of Data Science, Statistics, and Visualisation, 2021, issue 3

  22. Real-time outlier detection for large datasets by RT-DetMCD

    Authors: Bart De Ketelaere, Mia Hubert, Jakob Raymaekers, Peter J. Rousseeuw, Iwein Vranckx

    Abstract: Modern industrial machines can generate gigabytes of data in seconds, frequently pushing the boundaries of available computing power. Together with the time criticality of industrial processing this presents a challenging problem for any data analytics procedure. We focus on the deterministic minimum covariance determinant method (DetMCD), which detects outliers by fitting a robust covariance matr… ▽ More

    Submitted 24 January, 2020; v1 submitted 12 October, 2019; originally announced October 2019.

    Journal ref: Chemometrics and Intelligent Laboratory Systems, 2020, Volume 199

  23. Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

    Authors: Erich Schubert, Peter J. Rousseeuw

    Abstract: Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object wi… ▽ More

    Submitted 29 October, 2019; v1 submitted 12 October, 2018; originally announced October 2018.

    Journal ref: Similarity Search and Applications, SISAP 2019

  24. Clustering genomic words in human DNA using peaks and trends of distributions

    Authors: Ana Helena Tavares, Jakob Raymaekers, Peter J. Rousseeuw, Paula Brito, Vera Afreixo

    Abstract: In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the `trend'), and a sparse vector of… ▽ More

    Submitted 13 August, 2018; originally announced August 2018.

    Journal ref: Advances in Data Analysis and Classification, 2020

  25. Robust Identification of Target Genes and Outliers in Triple-negative Breast Cancer Data

    Authors: Pieter Segaert, Marta B. Lopes, Sandra Casimiro, Susana Vinga, Peter J. Rousseeuw

    Abstract: Correct classification of breast cancer sub-types is of high importance as it directly affects the therapeutic options. We focus on triple-negative breast cancer (TNBC) which has the worst prognosis among breast cancer types. Using cutting edge methods from the field of robust statistics, we analyze Breast Invasive Carcinoma (BRCA) transcriptomic data publicly available from The Cancer Genome Atla… ▽ More

    Submitted 4 July, 2018; originally announced July 2018.

    Journal ref: Statistical Methods in Medical Research, 2019, Vol. 28, 3042-3056

  26. MacroPCA: An all-in-one PCA method allowing for missing values as well as cellwise and rowwise outliers

    Authors: Mia Hubert, Peter J. Rousseeuw, Wannes Van den Bossche

    Abstract: Multivariate data are typically represented by a rectangular matrix (table) in which the rows are the objects (cases) and the columns are the variables (measurements). When there are many variables one often reduces the dimension by principal component analysis (PCA), which in its basic form is not robust to outliers. Much research has focused on handling rowwise outliers, i.e. rows that deviate f… ▽ More

    Submitted 9 December, 2018; v1 submitted 4 June, 2018; originally announced June 2018.

    Journal ref: Technometrics, 2019, Vol. 61, 459-473

  27. A generalized spatial sign covariance matrix

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: The well-known spatial sign covariance matrix (SSCM) carries out a radial transform which moves all data points to a sphere, followed by computing the classical covariance matrix of the transformed data. Its popularity stems from its robustness to outliers, fast computation, and applications to correlation and principal component analysis. In this paper we study more general radial functions. It i… ▽ More

    Submitted 10 October, 2018; v1 submitted 3 May, 2018; originally announced May 2018.

    Journal ref: Journal of Multivariate Analysis, 2019, Vol. 171, 94-111

  28. Discussion of "The power of monitoring"

    Authors: Jakob Raymaekers, Peter J. Rousseeuw, Iwein Vranckx

    Abstract: This is an invited comment on the discussion paper "The power of monitoring: how to make the most of a contaminated multivariate sample" by A. Cerioli, M. Riani, A. Atkinson and A. Corbellini that will appear in the journal Statistical Methods & Applications.

    Submitted 13 March, 2018; originally announced March 2018.

    Journal ref: Statistical Methods and Applications, 2018, Vol. 27, 589-594

  29. Fast robust correlation for high-dimensional data

    Authors: Jakob Raymaekers, Peter J. Rousseeuw

    Abstract: The product moment covariance is a cornerstone of multivariate data analysis, from which one can derive correlations, principal components, Mahalanobis distances and many other results. Unfortunately the product moment covariance and the corresponding Pearson correlation are very susceptible to outliers (anomalies) in the data. Several robust measures of covariance have been developed, but few are… ▽ More

    Submitted 20 October, 2019; v1 submitted 14 December, 2017; originally announced December 2017.

    Journal ref: Technometrics, vol. 63, 184-198 (2021)

  30. Comparing reverse complementary genomic words based on their distance distributions and frequencies

    Authors: Ana Helena Tavares, Jakob Raymaekers, Peter Rousseeuw, Raquel M. Silva, Carlos A. C. Bastos, Armando Pinho, Paula Brito, Vera Afreixo

    Abstract: In this work we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pair… ▽ More

    Submitted 6 October, 2017; originally announced October 2017.

    Comments: Post-print of a paper accepted to publication in "Interdisciplinary Sciences: Computational Life Sciences" (ISSN: 1913-2751, ESSN: 1867-1462)

    MSC Class: 62P10

    Journal ref: Interdisciplinary Sciences: Computational Life Sciences, 2018, Vol. 10, 1-11

  31. Minimum Covariance Determinant and Extensions

    Authors: Mia Hubert, Michiel Debruyne, Peter J. Rousseeuw

    Abstract: The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. Since estimating the covariance matrix is the cornerstone of many multivariate statistical methods, the MCD is an important building block when developing robust multivariate techniques. It also serves as a convenient and efficient tool for out… ▽ More

    Submitted 20 September, 2017; originally announced September 2017.

    Journal ref: WIREs Computational Statistics, 2017, wics.1421

  32. Econometric applications of high-breakdown robust regression techniques

    Authors: Asad Zaman, Peter J. Rousseeuw, Mehmet Orhan

    Abstract: A literature search shows that robust regression techniques are rarely used in applied econometrics. We list several misconceptions about robustness which lead to this situation. We show that most data sets are not normal, least squares performs very poorly even in large data sets with small numbers of outliers, and that commonly used techniques for achieving robustness fail to do so. We then prov… ▽ More

    Submitted 1 September, 2017; originally announced September 2017.

    Comments: 8 pages, adds paragraph omitted from final print version in journal

    MSC Class: 62J05 ACM Class: G.3

    Journal ref: Economics Letters Volume 71, Issue 1, April 2001, Pages 1-8

  33. Robust Monitoring of Time Series with Application to Fraud Detection

    Authors: Peter J. Rousseeuw, Domenico Perrotta, Marco Riani, Mia Hubert

    Abstract: Time series often contain outliers and level shifts or structural changes. These unexpected events are of the utmost importance in fraud detection, as they may pinpoint suspicious transactions. The presence of such unusual events can easily mislead conventional time series analysis and yield erroneous conclusions. In this paper we provide a unified framework for detecting outliers and level shifts… ▽ More

    Submitted 20 May, 2018; v1 submitted 28 August, 2017; originally announced August 2017.

    Journal ref: Econometrics and Statistics, 2019, Vol. 9, 108-121

  34. Anomaly Detection by Robust Statistics

    Authors: Peter J. Rousseeuw, Mia Hubert

    Abstract: Real data often contain anomalous cases, also known as outliers. These may spoil the resulting analysis but they may also contain valuable information. In either case, the ability to detect such anomalies is essential. A useful tool for this purpose is robust statistics, which aims to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it.… ▽ More

    Submitted 14 October, 2017; v1 submitted 31 July, 2017; originally announced July 2017.

    Comments: To appear in WIREs Data Mining and Knowledge Discovery

    Journal ref: WIREs Data Mining and Knowledge Discovery, 2018, widm.1236

  35. Dissimilar Symmetric Word Pairs in the Human Genome

    Authors: Ana Helena Tavares, Jakob Raymaekers, Peter J. Rousseeuw, Raquel M. Silva, Carlos A. C. Bastos, Armando Pinho, Paula Brito, Vera Afreixo

    Abstract: In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions. Since symmetric pairs with different patterns could point to evolutionary features, we search for the pairs with the most dissimilar behaviour. We focus our study… ▽ More

    Submitted 5 July, 2017; v1 submitted 14 February, 2017; originally announced February 2017.

    Comments: Submitted 13-Feb-2017; accepted, after a minor revision, 17-Mar-2017; 11th International Conference on Practical Applications of Computational Biology & Bioinformatics, PACBB 2017, Porto, Portugal, 21-23 June, 2017

    Journal ref: Advances in Intelligent Systems and Computing, Vol 616, 248-256. Springer, 2017

  36. The Minimum Regularized Covariance Determinant estimator

    Authors: Kris Boudt, Peter J. Rousseeuw, Steven Vanduffel, Tim Verdonck

    Abstract: The Minimum Covariance Determinant (MCD) approach robustly estimates the location and scatter matrix using the subset of given size with lowest sample covariance determinant. Its main drawback is that it cannot be applied when the dimension exceeds the subset size. We propose the Minimum Regularized Covariance Determinant (MRCD) approach, which differs from the MCD in that the scatter matrix is a… ▽ More

    Submitted 1 December, 2018; v1 submitted 24 January, 2017; originally announced January 2017.

    Journal ref: Statistics and Computing, 2020, Vol. 30, 113-128

  37. A Measure of Directional Outlyingness with Applications to Image Data and Video

    Authors: Peter J. Rousseeuw, Jakob Raymaekers, Mia Hubert

    Abstract: Functional data covers a wide range of data types. They all have in common that the observed objects are functions of of a univariate argument (e.g. time or wavelength) or a multivariate argument (say, a spatial position). These functions take on values which can in turn be univariate (such as the absorbance level) or multivariate (such as the red/green/blue color levels of an image). In practice… ▽ More

    Submitted 3 March, 2017; v1 submitted 17 August, 2016; originally announced August 2016.

    Journal ref: Journal of Computational and Graphical Statistics, 2018, Vol. 27, 345-359

  38. arXiv:1601.08133  [pdf, other

    stat.ME

    Finding Outliers in Surface Data and Video

    Authors: Mia Hubert, Jakob Raymaekers, Peter J. Rousseeuw, Pieter Segaert

    Abstract: Surface, image and video data can be considered as functional data with a bivariate domain. To detect outlying surfaces or images, a new method is proposed based on the mean and the variability of the degree of outlyingness at each grid point. A rule is constructed to flag the outliers in the resulting functional outlier map. Heatmaps of their outlyingness indicate the regions which are most devia… ▽ More

    Submitted 29 January, 2016; originally announced January 2016.

  39. Detecting deviating data cells

    Authors: Peter J. Rousseeuw, Wannes Van den Bossche

    Abstract: A multivariate dataset consists of $n$ cases in $d$ dimensions, and is often stored in an $n$ by $d$ data matrix. It is well-known that real data may contain outliers. Depending on the situation, outliers may be (a) undesirable errors which can adversely affect the data analysis, or (b) valuable nuggets of unexpected information. In statistics and data analysis the word outlier usually refers to a… ▽ More

    Submitted 14 October, 2017; v1 submitted 26 January, 2016; originally announced January 2016.

    Comments: To appear in Technometrics

    Journal ref: Technometrics, 60, 135-145, 2018

  40. arXiv:1508.03828  [pdf, ps, other

    stat.ME

    Statistical depth meets computational geometry: a short survey

    Authors: Peter J. Rousseeuw, Mia Hubert

    Abstract: During the past two decades there has been a lot of interest in developing statistical depth notions that generalize the univariate concept of ranking to multivariate data. The notion of depth has also been extended to regression models and functional data. However, computing such depth functions as well as their contours and deepest points is not trivial. Techniques of computational geometry appe… ▽ More

    Submitted 16 August, 2015; originally announced August 2015.

  41. Multivariate and functional classification using depth and distance

    Authors: Mia Hubert, Peter J. Rousseeuw, Pieter Segaert

    Abstract: We construct classifiers for multivariate and functional data. Our approach is based on a kind of distance between data points and classes. The distance measure needs to be robust to outliers and invariant to linear transformations of the data. For this purpose we can use the bagdistance which is based on halfspace depth. It satisfies most of the properties of a norm but is able to reflect asymmet… ▽ More

    Submitted 7 July, 2016; v1 submitted 5 April, 2015; originally announced April 2015.

    MSC Class: 62G99

    Journal ref: Advances in Data Analysis and Classification, 2017, Vol. 11, 445-466

  42. High-Breakdown Robust Multivariate Methods

    Authors: Mia Hubert, Peter J. Rousseeuw, Stefan Van Aelst

    Abstract: When applying a statistical method in practice it often occurs that some observations deviate from the usual assumptions. However, many classical methods are sensitive to outliers. The goal of robust statistics is to develop methods that are robust against the possibility that one or several unannounced outliers may occur anywhere in the data. These methods then allow to detect outlying observat… ▽ More

    Submitted 5 August, 2008; originally announced August 2008.

    Comments: Published in at http://dx.doi.org/10.1214/088342307000000087 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

    Report number: IMS-STS-STS229

    Journal ref: Statistical Science 2008, Vol. 23, No. 1, 92-119