Learning Disease vs Participant Signatures: a permutation test approach to detect identity confounding in machine learning diagnostic applications

Neto, Elias Chaibub; Pratap, Abhishek; Perumal, Thanneer M; Tummalacherla, Meghasyam; Bot, Brian M; Trister, Andrew D; Friend, Stephen H; Mangravite, Lara; Omberg, Larsson

Statistics > Applications

arXiv:1712.03120 (stat)

[Submitted on 8 Dec 2017 (v1), last revised 6 Jul 2018 (this version, v2)]

Title:Learning Disease vs Participant Signatures: a permutation test approach to detect identity confounding in machine learning diagnostic applications

Authors:Elias Chaibub Neto, Abhishek Pratap, Thanneer M Perumal, Meghasyam Tummalacherla, Brian M Bot, Andrew D Trister, Stephen H Friend, Lara Mangravite, Larsson Omberg

View PDF

Abstract:Recently, Saeb et al (2017) showed that, in diagnostic machine learning applications, having data of each subject randomly assigned to both training and test sets (record-wise data split) can lead to massive underestimation of the cross-validation prediction error, due to the presence of "subject identity confounding" caused by the classifier's ability to identify subjects, instead of recognizing disease. To solve this problem, the authors recommended the random assignment of the data of each subject to either the training or the test set (subject-wise data split). The adoption of subject-wise split has been criticized in Little et al (2017), on the basis that it can violate assumptions required by cross-validation to consistently estimate generalization error. In particular, adopting subject-wise splitting in heterogeneous data-sets might lead to model under-fitting and larger classification errors. Hence, Little et al argue that perhaps the overestimation of prediction errors with subject-wise cross-validation, rather than underestimation with record-wise cross-validation, is the reason for the discrepancies between prediction error estimates generated by the two splitting strategies. In order to shed light on this controversy, we focus on simpler classification performance metrics and develop permutation tests that can detect identity confounding. By focusing on permutation tests, we are able to evaluate the merits of record-wise and subject-wise data splits under more general statistical dependencies and distributional structures of the data, including situations where cross-validation breaks down. We illustrate the application of our tests using synthetic and real data from a Parkinson's disease study.

Subjects:	Applications (stat.AP)
Cite as:	arXiv:1712.03120 [stat.AP]
	(or arXiv:1712.03120v2 [stat.AP] for this version)
	https://doi.org/10.48550/arXiv.1712.03120

Submission history

From: Elias Chaibub Neto [view email]
[v1] Fri, 8 Dec 2017 15:23:58 UTC (1,137 KB)
[v2] Fri, 6 Jul 2018 23:32:58 UTC (1,137 KB)

Statistics > Applications

Title:Learning Disease vs Participant Signatures: a permutation test approach to detect identity confounding in machine learning diagnostic applications

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Statistics > Applications

Title:Learning Disease vs Participant Signatures: a permutation test approach to detect identity confounding in machine learning diagnostic applications

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators