Privacy-Aware, Public-Aligned: Embedding Risk Detection and Public Values into Scalable Clinical Text De-Identification for Trusted Research Environments

Casey, Arlene; Dunbar, Stuart; Gruber, Franz; McInerney, Samuel; Falis, Matúš; Linksted, Pamela; Wilde, Katie; Harrison, Kathy; Hamilton, Alison; Cole, Christian

Computer Science > Cryptography and Security

arXiv:2506.02063 (cs)

[Submitted on 1 Jun 2025]

Title:Privacy-Aware, Public-Aligned: Embedding Risk Detection and Public Values into Scalable Clinical Text De-Identification for Trusted Research Environments

Authors:Arlene Casey, Stuart Dunbar, Franz Gruber, Samuel McInerney, Matúš Falis, Pamela Linksted, Katie Wilde, Kathy Harrison, Alison Hamilton, Christian Cole

View PDF

Abstract:Clinical free-text data offers immense potential to improve population health research such as richer phenotyping, symptom tracking, and contextual understanding of patient care. However, these data present significant privacy risks due to the presence of directly or indirectly identifying information embedded in unstructured narratives. While numerous de-identification tools have been developed, few have been tested on real-world, heterogeneous datasets at scale or assessed for governance readiness. In this paper, we synthesise our findings from previous studies examining the privacy-risk landscape across multiple document types and NHS data providers in Scotland. We characterise how direct and indirect identifiers vary by record type, clinical setting, and data flow, and show how changes in documentation practice can degrade model performance over time. Through public engagement, we explore societal expectations around the safe use of clinical free text and reflect these in the design of a prototype privacy-risk management tool to support transparent, auditable decision-making. Our findings highlight that privacy risk is context-dependent and cumulative, underscoring the need for adaptable, hybrid de-identification approaches that combine rule-based precision with contextual understanding. We offer a comprehensive view of the challenges and opportunities for safe, scalable reuse of clinical free-text within Trusted Research Environments and beyond, grounded in both technical evidence and public perspectives on responsible data use.

Subjects:	Cryptography and Security (cs.CR)
Cite as:	arXiv:2506.02063 [cs.CR]
	(or arXiv:2506.02063v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2506.02063

Submission history

From: Arlene Casey J [view email]
[v1] Sun, 1 Jun 2025 17:45:57 UTC (564 KB)

Computer Science > Cryptography and Security

Title:Privacy-Aware, Public-Aligned: Embedding Risk Detection and Public Values into Scalable Clinical Text De-Identification for Trusted Research Environments

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:Privacy-Aware, Public-Aligned: Embedding Risk Detection and Public Values into Scalable Clinical Text De-Identification for Trusted Research Environments

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators