-
Learning from data with structured missingness
Authors:
Robin Mitra,
Sarah F. McGough,
Tapabrata Chakraborti,
Chris Holmes,
Ryan Copping,
Niels Hagenbuch,
Stefanie Biedermann,
Jack Noonan,
Brieuc Lehmann,
Aditi Shenvi,
Xuan Vinh Doan,
David Leslie,
Ginestra Bianconi,
Ruben Sanchez-Garcia,
Alisha Davies,
Maxine Mackintosh,
Eleni-Rosalina Andrinopoulou,
Anahid Basiri,
Chris Harbron,
Ben D. MacArthur
Abstract:
Missing data are an unavoidable complication in many machine learning tasks. When data are `missing at random' there exist a range of tools and techniques to deal with the issue. However, as machine learning studies become more ambitious, and seek to learn from ever-larger volumes of heterogeneous data, an increasingly encountered problem arises in which missing values exhibit an association or st…
▽ More
Missing data are an unavoidable complication in many machine learning tasks. When data are `missing at random' there exist a range of tools and techniques to deal with the issue. However, as machine learning studies become more ambitious, and seek to learn from ever-larger volumes of heterogeneous data, an increasingly encountered problem arises in which missing values exhibit an association or structure, either explicitly or implicitly. Such `structured missingness' raises a range of challenges that have not yet been systematically addressed, and presents a fundamental hindrance to machine learning at scale. Here, we outline the current literature and propose a set of grand challenges in learning from data with structured missingness.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
An integrated approach to test for missing not at random
Authors:
Jack Noonan,
Adetola Adedamola Adediran,
Robin Mitra,
Stefanie Biedermann
Abstract:
Missing data can lead to inefficiencies and biases in analyses, in particular when data are missing not at random (MNAR). It is thus vital to understand and correctly identify the missing data mechanism. Recovering missing values through a follow up sample allows researchers to conduct hypothesis tests for MNAR, which are not possible when using only the original incomplete data. Investigating how…
▽ More
Missing data can lead to inefficiencies and biases in analyses, in particular when data are missing not at random (MNAR). It is thus vital to understand and correctly identify the missing data mechanism. Recovering missing values through a follow up sample allows researchers to conduct hypothesis tests for MNAR, which are not possible when using only the original incomplete data. Investigating how properties of these tests are affected by the follow up sample design is little explored in the literature. Our results provide comprehensive insight into the properties of one such test, based on the commonly used selection model framework. We determine conditions for recovery samples that allow the test to be applied appropriately and effectively, i.e. with known Type I error rates and optimized with respect to power. We thus provide an integrated framework for testing for the presence of MNAR and designing follow up samples in an efficient cost-effective way. The performance of our methodology is evaluated through simulation studies as well as on a real data sample.
△ Less
Submitted 7 December, 2022; v1 submitted 15 August, 2022;
originally announced August 2022.
-
Memetic Graph Clustering
Authors:
Sonja Biedermann,
Monika Henzinger,
Christian Schulz,
Bernhard Schuster
Abstract:
It is common knowledge that there is no single best strategy for graph clustering, which justifies a plethora of existing approaches. In this paper, we present a general memetic algorithm, VieClus, to tackle the graph clustering problem. This algorithm can be adapted to optimize different objective functions. A key component of our contribution are natural recombine operators that employ ensemble…
▽ More
It is common knowledge that there is no single best strategy for graph clustering, which justifies a plethora of existing approaches. In this paper, we present a general memetic algorithm, VieClus, to tackle the graph clustering problem. This algorithm can be adapted to optimize different objective functions. A key component of our contribution are natural recombine operators that employ ensemble clusterings as well as multi-level techniques. Lastly, we combine these techniques with a scalable communication protocol, producing a system that is able to compute high-quality solutions in a short amount of time. We instantiate our scheme with local search for modularity and show that our algorithm successfully improves or reproduces all entries of the 10th DIMACS implementation~challenge under consideration using a small amount of time.
△ Less
Submitted 20 February, 2018;
originally announced February 2018.