Skip to main content

Showing 1–34 of 34 results for author: Verdonck, T

.
  1. arXiv:2506.04292  [pdf, ps, other

    cs.SI cs.LG stat.AP

    GARG-AML against Smurfing: A Scalable and Interpretable Graph-Based Framework for Anti-Money Laundering

    Authors: Bruno Deprez, Bart Baesens, Tim Verdonck, Wouter Verbeke

    Abstract: Money laundering poses a significant challenge as it is estimated to account for 2%-5% of the global GDP. This has compelled regulators to impose stringent controls on financial institutions. One prominent laundering method for evading these controls, called smurfing, involves breaking up large transactions into smaller amounts. Given the complexity of smurfing schemes, which involve multiple tran… ▽ More

    Submitted 4 June, 2025; originally announced June 2025.

  2. arXiv:2505.09425  [pdf, ps, other

    stat.CO cs.LG

    Independent Component Analysis by Robust Distance Correlation

    Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw, Tom Van Deuren, Tim Verdonck

    Abstract: Independent component analysis (ICA) is a powerful tool for decomposing a multivariate signal or distribution into fully independent sources, not just uncorrelated ones. Unfortunately, most approaches to ICA are not robust against outliers. Here we propose a robust ICA method called RICA, which estimates the components by minimizing a robust measure of dependence between multivariate random variab… ▽ More

    Submitted 14 May, 2025; originally announced May 2025.

  3. arXiv:2503.24259  [pdf, other

    cs.LG

    Advances in Continual Graph Learning for Anti-Money Laundering Systems: A Comprehensive Review

    Authors: Bruno Deprez, Wei Wei, Wouter Verbeke, Bart Baesens, Kevin Mets, Tim Verdonck

    Abstract: Financial institutions are required by regulation to report suspicious financial transactions related to money laundering. Therefore, they need to constantly monitor vast amounts of incoming and outgoing transactions. A particular challenge in detecting money laundering is that money launderers continuously adapt their tactics to evade detection. Hence, detection methods need constant fine-tuning.… ▽ More

    Submitted 31 March, 2025; originally announced March 2025.

  4. arXiv:2502.10185  [pdf, other

    cs.LG

    A Powerful Random Forest Featuring Linear Extensions (RaFFLE)

    Authors: Jakob Raymaekers, Peter J. Rousseeuw, Thomas Servotte, Tim Verdonck, Ruicong Yao

    Abstract: Random forests are widely used in regression. However, the decision trees used as base learners are poor approximators of linear relationships. To address this limitation we propose RaFFLE (Random Forest Featuring Linear Extensions), a novel framework that integrates the recently developed PILOT trees (Piecewise Linear Organic Trees) as base learners within a random forest ensemble. PILOT trees co… ▽ More

    Submitted 14 February, 2025; originally announced February 2025.

  5. arXiv:2411.01954  [pdf, other

    stat.CO stat.ML

    RobPy: a Python Package for Robust Statistical Methods

    Authors: Sarah Leyder, Jakob Raymaekers, Peter J. Rousseeuw, Thomas Servotte, Tim Verdonck

    Abstract: Robust estimation provides essential tools for analyzing data that contain outliers, ensuring that statistical models remain reliable even in the presence of some anomalous data. While robust methods have long been available in R, users of Python have lacked a comprehensive package that offers these methods in a cohesive framework. RobPy addresses this gap by offering a wide range of robust method… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

  6. arXiv:2406.10001  [pdf, other

    cs.CE

    Global Crop-Specific Fertilization Dataset from 1961-2019

    Authors: Fernando Coello, Thomas Decorte, Iris Janssens, Steven Mortier, Jordi Sardans, Josep Peñuelas, Tim Verdonck

    Abstract: As global fertilizer application rates increase, high-quality datasets are paramount for comprehensive analyses to support informed decision-making and policy formulation in crucial areas such as food security or climate change. This study aims to fill existing data gaps by employing two machine learning models, eXtreme Gradient Boosting and HistGradientBoosting algorithms to produce precise count… ▽ More

    Submitted 11 November, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: 47 pages, 7 figures, 8 tables

  7. arXiv:2406.08206  [pdf, other

    cs.LG

    Sources of Gain: Decomposing Performance in Conditional Average Dose Response Estimation

    Authors: Christopher Bockel-Rickermann, Toon Vanderschueren, Tim Verdonck, Wouter Verbeke

    Abstract: Estimating conditional average dose responses (CADR) is an important but challenging problem. Estimators must correctly model the potentially complex relationships between covariates, interventions, doses, and outcomes. In recent years, the machine learning community has shown great interest in developing tailored CADR estimators that target specific challenges. Their performance is typically eval… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: 25 pages, 9 figures

    MSC Class: 62D20

  8. arXiv:2405.19383  [pdf, other

    cs.SI cs.LG

    Network Analytics for Anti-Money Laundering -- A Systematic Literature Review and Experimental Evaluation

    Authors: Bruno Deprez, Toon Vanderschueren, Bart Baesens, Tim Verdonck, Wouter Verbeke

    Abstract: Money laundering presents a pervasive challenge, burdening society by financing illegal activities. The use of network information is increasingly being explored to more effectively combat money laundering, given it involves connected parties. This led to a surge in research on network analytics (NA) for anti-money laundering (AML). The literature on NA for AML is, however, fragmented and a compre… ▽ More

    Submitted 19 March, 2025; v1 submitted 29 May, 2024; originally announced May 2024.

  9. Inferring the relationship between soil temperature and the normalized difference vegetation index with machine learning

    Authors: Steven Mortier, Amir Hamedpour, Bart Bussmann, Ruth Phoebe Tchana Wandji, Steven Latré, Bjarni D. Sigurdsson, Tom De Schepper, Tim Verdonck

    Abstract: Changes in climate can greatly affect the phenology of plants, which can have important feedback effects, such as altering the carbon cycle. These phenological feedback effects are often induced by a shift in the start or end dates of the growing season of plants. The normalized difference vegetation index (NDVI) serves as a straightforward indicator for assessing the presence of green vegetation… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: 31 pages, 7 figures, 5 tables

  10. arXiv:2312.00090  [pdf, other

    cs.LG stat.AP

    Tree-based Forecasting of Day-ahead Solar Power Generation from Granular Meteorological Features

    Authors: Nick Berlanger, Noah van Ophoven, Tim Verdonck, Ines Wilms

    Abstract: Accurate forecasts for day-ahead photovoltaic (PV) power generation are crucial to support a high PV penetration rate in the local electricity grid and to assure stability in the grid. We use state-of-the-art tree-based machine learning methods to produce such forecasts and, unlike previous studies, we hereby account for (i) the effects various meteorological as well as astronomical features have… ▽ More

    Submitted 30 November, 2023; originally announced December 2023.

  11. arXiv:2309.03731  [pdf, other

    cs.LG stat.ME

    Using representation balancing to learn conditional-average dose responses from clustered data

    Authors: Christopher Bockel-Rickermann, Toon Vanderschueren, Jeroen Berrevoets, Tim Verdonck, Wouter Verbeke

    Abstract: Estimating a unit's responses to interventions with an associated dose, the "conditional average dose response" (CADR), is relevant in a variety of domains, from healthcare to business, economics, and beyond. Such a response typically needs to be estimated from observational data, which introduces several challenges. That is why the machine learning (ML) community has proposed several tailored CAD… ▽ More

    Submitted 26 July, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

    Comments: 21 pages, 7 figures, v2: updated methodology and experiments

    MSC Class: 62D20

  12. arXiv:2309.03730  [pdf, other

    cs.LG econ.EM

    A Causal Perspective on Loan Pricing: Investigating the Impacts of Selection Bias on Identifying Bid-Response Functions

    Authors: Christopher Bockel-Rickermann, Sam Verboven, Tim Verdonck, Wouter Verbeke

    Abstract: In lending, where prices are specific to both customers and products, having a well-functioning personalized pricing policy in place is essential to effective business making. Typically, such a policy must be derived from observational data, which introduces several challenges. While the problem of ``endogeneity'' is prominently studied in the established pricing literature, the problem of selecti… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: 24 pages, 5 figures

  13. arXiv:2308.05422  [pdf, other

    stat.ME stat.ML

    TSLiNGAM: DirectLiNGAM under heavy tails

    Authors: Sarah Leyder, Jakob Raymaekers, Tim Verdonck

    Abstract: One of the established approaches to causal discovery consists of combining directed acyclic graphs (DAGs) with structural causal models (SCMs) to describe the functional dependencies of effects on their causes. Possible identifiability of SCMs given data depends on assumptions made on the noise variables and the functional classes in the SCM. For instance, in the LiNGAM model, the functional clas… ▽ More

    Submitted 10 August, 2023; originally announced August 2023.

    Comments: 35 pages, 10 figures

  14. arXiv:2303.05836  [pdf, other

    stat.ME

    Generalized Spherical Principal Component Analysis

    Authors: Sarah Leyder, Jakob Raymaekers, Tim Verdonck

    Abstract: Outliers contaminating data sets are a challenge to statistical estimators. Even a small fraction of outlying observations can heavily influence most classical statistical methods. In this paper we propose generalized spherical principal component analysis, a new robust version of principal component analysis that is based on the generalized spatial sign covariance matrix. Supporting theoretical p… ▽ More

    Submitted 10 March, 2023; originally announced March 2023.

  15. arXiv:2302.03931  [pdf, other

    stat.ML cs.LG stat.ME

    Fast Linear Model Trees by PILOT

    Authors: Jakob Raymaekers, Peter J. Rousseeuw, Tim Verdonck, Ruicong Yao

    Abstract: Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addit… ▽ More

    Submitted 8 February, 2023; originally announced February 2023.

    Journal ref: Machine Learning, 2024

  16. arXiv:2301.01109  [pdf, other

    cs.LG econ.EM

    On the causality-preservation capabilities of generative modelling

    Authors: Yves-Cédric Bauwelinckx, Jan Dhaene, Tim Verdonck, Milan van den Heuvel

    Abstract: Modeling lies at the core of both the financial and the insurance industry for a wide variety of tasks. The rise and development of machine learning and deep learning models have created many opportunities to improve our modeling toolbox. Breakthroughs in these fields often come with the requirement of large amounts of data. Such large datasets are often not publicly available in finance and insur… ▽ More

    Submitted 3 January, 2023; originally announced January 2023.

  17. Fraud Analytics: A Decade of Research -- Organizing Challenges and Solutions in the Field

    Authors: Christopher Bockel-Rickermann, Tim Verdonck, Wouter Verbeke

    Abstract: The literature on fraud analytics and fraud detection has seen a substantial increase in output in the past decade. This has led to a wide range of research topics and overall little organization of the many aspects of fraud analytical research. The focus of academics ranges from identifying fraudulent credit card payments to spotting illegitimate insurance claims. In addition, there is a wide ran… ▽ More

    Submitted 7 December, 2022; originally announced December 2022.

  18. arXiv:2212.00101  [pdf, other

    stat.AP econ.EM

    mCube: Multinomial Micro-level reserving Model

    Authors: Emmanuel Jordy Menvouta, Jolien Ponnet, Robin Van Oirbeek, Tim Verdonck

    Abstract: This paper presents a multinomial multi-state micro-level reserving model, denoted mCube. We propose a unified framework for modelling the time and the payment process for IBNR and RBNS claims and for modeling IBNR claim counts. We use multinomial distributions for the time process and spliced mixture models for the payment process. We illustrate the excellent performance of the proposed model on… ▽ More

    Submitted 30 November, 2022; originally announced December 2022.

    ACM Class: G.3

  19. arXiv:2206.01562  [pdf, other

    econ.GN cs.LG stat.ML

    Prescriptive maintenance with causal machine learning

    Authors: Toon Vanderschueren, Robert Boute, Tim Verdonck, Bart Baesens, Wouter Verbeke

    Abstract: Machine maintenance is a challenging operational problem, where the goal is to plan sufficient preventive maintenance to avoid machine failures and overhauls. Maintenance is often imperfect in reality and does not make the asset as good as new. Although a variety of imperfect maintenance policies have been proposed in the literature, these rely on strong assumptions regarding the effect of mainten… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

  20. arXiv:2202.04369  [pdf, other

    cs.LG stat.ML

    A new perspective on classification: optimally allocating limited resources to uncertain tasks

    Authors: Toon Vanderschueren, Bart Baesens, Tim Verdonck, Wouter Verbeke

    Abstract: A central problem in business concerns the optimal allocation of limited resources to a set of available tasks, where the payoff of these tasks is inherently uncertain. In credit card fraud detection, for instance, a bank can only assign a small subset of transactions to their fraud investigations team. Typically, such problems are solved using a classification framework, where the focus is on pre… ▽ More

    Submitted 9 February, 2022; originally announced February 2022.

  21. Noise robustness of persistent homology on greyscale images, across filtrations and signatures

    Authors: Renata Turkeš, Jannes Nys, Tim Verdonck, Steven Latré

    Abstract: Topological data analysis is a recent and fast growing field that approaches the analysis of datasets using techniques from (algebraic) topology. Its main tool, persistent homology (PH), has seen a notable increase in applications in the last decade. Often cited as the most favourable property of PH and the main reason for practical success are the stability theorems that give theoretical results… ▽ More

    Submitted 17 August, 2021; v1 submitted 16 August, 2021; originally announced August 2021.

    Comments: 24 pages, 7 figures, 4 tables

  22. arXiv:2105.10392  [pdf, ps, other

    stat.CO stat.ML

    Computational Efficient Approximations of the Concordance Probability in a Big Data Setting

    Authors: Robin Van Oirbeek, Jolien Ponnet, Tim Verdonck

    Abstract: Performance measurement is an essential task once a statistical model is created. The Area Under the receiving operating characteristics Curve (AUC) is the most popular measure for evaluating the quality of a binary classifier. In this case, AUC is equal to the concordance probability, a frequently used measure to evaluate the discriminatory power of the model. Contrary to AUC, the concordance pro… ▽ More

    Submitted 21 May, 2021; originally announced May 2021.

    Comments: 40 pages, 3 figures

  23. arXiv:2101.01494  [pdf, other

    stat.ML cs.LG

    Weight-of-evidence 2.0 with shrinkage and spline-binning

    Authors: Jakob Raymaekers, Wouter Verbeke, Tim Verdonck

    Abstract: In many practical applications, such as fraud detection, credit risk modeling or medical decision making, classification models for assigning instances to a predefined set of classes are required to be both precise as well as interpretable. Linear modeling methods such as logistic regression are often adopted, since they offer an acceptable balance between precision and interpretability. Linear me… ▽ More

    Submitted 24 September, 2021; v1 submitted 5 January, 2021; originally announced January 2021.

    Comments: New version: duplicate paragraph omitted

  24. arXiv:2012.06893  [pdf, other

    stat.ME

    Sparse dimension reduction based on energy and ball statistics

    Authors: Emmanuel Jordy Menvouta, Sven Serneels, Tim Verdonck

    Abstract: As its name suggests, sufficient dimension reduction (SDR) targets to estimate a subspace from data that contains all information sufficient to explain a dependent variable. Ample approaches exist to SDR, some of the most recent of which rely on minimal to no model assumptions. These are defined according to an optimization criterion that maximizes a nonparametric measure of association. The origi… ▽ More

    Submitted 12 December, 2020; originally announced December 2020.

    MSC Class: 62G05; 62H12

  25. arXiv:2006.01635  [pdf, other

    stat.CO

    direpack: A Python 3 package for state-of-the-art statistical dimension reduction methods

    Authors: Emmanuel Jordy Menvouta, Sven Serneels, Tim Verdonck

    Abstract: The direpack package aims to establish a set of modern statistical dimension reduction techniques into the Python universe as a single, consistent package. The dimension reduction methods included resort into three categories: projection pursuit based dimension reduction, sufficient dimension reduction, and robust M estimators for dimension reduction. As a corollary, regularized regression estimat… ▽ More

    Submitted 30 May, 2020; originally announced June 2020.

    MSC Class: 62H20; 62H12; 62H25; 62P99

  26. arXiv:2005.02488  [pdf, other

    stat.AP

    Instance-Dependent Cost-Sensitive Learning for Detecting Transfer Fraud

    Authors: Sebastiaan Höppner, Bart Baesens, Wouter Verbeke, Tim Verdonck

    Abstract: Card transaction fraud is a growing problem affecting card holders worldwide. Financial institutions increasingly rely upon data-driven methods for developing fraud detection systems, which are able to automatically detect and block fraudulent transactions. From a machine learning perspective, the task of detecting fraudulent transactions is a binary classification problem. Classification models a… ▽ More

    Submitted 5 May, 2020; originally announced May 2020.

    Comments: 24 pages, 4 figures, submitted

  27. arXiv:2003.11915  [pdf, other

    cs.LG cs.CR stat.AP stat.ML

    robROSE: A robust approach for dealing with imbalanced data in fraud detection

    Authors: Bart Baesens, Sebastiaan Höppner, Irene Ortner, Tim Verdonck

    Abstract: A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set. In most data sets, fraud occurs in typically less than 0.5% of the cases. Detecting fraud in such a highly imbalanced data set typically leads to predictions that favor the majority group, causing fraud to remain undetected. We discuss some po… ▽ More

    Submitted 22 March, 2020; originally announced March 2020.

  28. arXiv:1912.03407  [pdf, other

    stat.ME

    Cellwise Robust M Regression

    Authors: Peter Filzmoser, Sebastiaan Höppner, Irene Ortner, Sven Serneels, Tim Verdonck

    Abstract: The cellwise robust M regression estimator is introduced as the first estimator of its kind that intrinsically yields both a map of cellwise outliers consistent with the linear model, and a vector of regression coefficients that is robust against vertical outliers and leverage points. As a by-product, the method yields a weighted and imputed data set that contains estimates of what the values in c… ▽ More

    Submitted 16 March, 2020; v1 submitted 6 December, 2019; originally announced December 2019.

    Journal ref: Computational Statistics and Data Analysis, 147 (2020), 106944

  29. arXiv:1911.06187  [pdf

    math.AP stat.ML

    Concordance probability in a big data setting: application in non-life insurance

    Authors: Robin Van Oirbeek, Christopher Grumiau, Tim Verdonck

    Abstract: The concordance probability or C-index is a popular measure to capture the discriminatory ability of a regression model. In this article, the definition of this measure is adapted to the specific needs of the frequency and severity model, typically used during the technical pricing of a non-life insurance product. Due to the typical large sample size of the frequency data in particular, two differ… ▽ More

    Submitted 14 November, 2019; originally announced November 2019.

  30. arXiv:1806.09803  [pdf, ps, other

    stat.AP

    Multivariate Constrained Robust M-Regression for Shaping Forward Curves in Electricity Markets

    Authors: Peter Leoni, Pieter Segaert, Sven Serneels, Tim Verdonck

    Abstract: In this paper, a multivariate constrained robust M-regression (MCRM) method is developed to estimate shaping coefficients for electricity forward prices. An important benefit of the new method is that model arbitrage can be ruled out at an elementary level, as all shaping coefficients are treated simultaneously. Moreover, the new method is robust to outliers, such that the provided results are sta… ▽ More

    Submitted 26 June, 2018; originally announced June 2018.

  31. arXiv:1712.08101  [pdf, other

    stat.ML cs.LG stat.AP

    Profit Driven Decision Trees for Churn Prediction

    Authors: Sebastiaan Höppner, Eugen Stripling, Bart Baesens, Seppe vanden Broucke, Tim Verdonck

    Abstract: Customer retention campaigns increasingly rely on predictive models to detect potential churners in a vast customer base. From the perspective of machine learning, the task of predicting customer churn can be presented as a binary classification problem. Using data on historic behavior, classification algorithms are built with the purpose of accurately predicting the probability of a customer defe… ▽ More

    Submitted 21 December, 2017; originally announced December 2017.

  32. Outlyingness: why do outliers lie out?

    Authors: Michiel Debruyne, Sebastiaan Höppner, Sven Serneels, Tim Verdonck

    Abstract: Outlier detection is an inevitable step to most statistical data analyses. However, the mere detection of an outlying case does not always answer all scientific questions associated with that data point. Outlier detection techniques, classical and robust alike, will typically flag the entire case as outlying, or attribute a specific case weight to the entire case. In practice, particularly in high… ▽ More

    Submitted 12 August, 2017; originally announced August 2017.

  33. The Minimum Regularized Covariance Determinant estimator

    Authors: Kris Boudt, Peter J. Rousseeuw, Steven Vanduffel, Tim Verdonck

    Abstract: The Minimum Covariance Determinant (MCD) approach robustly estimates the location and scatter matrix using the subset of given size with lowest sample covariance determinant. Its main drawback is that it cannot be applied when the dimension exceeds the subset size. We propose the Minimum Regularized Covariance Determinant (MRCD) approach, which differs from the MCD in that the scatter matrix is a… ▽ More

    Submitted 1 December, 2018; v1 submitted 24 January, 2017; originally announced January 2017.

    Journal ref: Statistics and Computing, 2020, Vol. 30, 113-128

  34. Robust bootstrap procedures for the chain-ladder method

    Authors: Kris Peremans, Pieter Segaert, Stefan Van Aelst, Tim Verdonck

    Abstract: Insurers are faced with the challenge of estimating the future reserves needed to handle historic and outstanding claims that are not fully settled. A well-known and widely used technique is the chain-ladder method, which is a deterministic algorithm. To include a stochastic component one may apply generalized linear models to the run-off triangles based on past claims data. Analytical expressions… ▽ More

    Submitted 14 January, 2017; originally announced January 2017.