-
Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit
Authors:
Zeljko Kraljevic,
Thomas Searle,
Anthony Shek,
Lukasz Roguski,
Kawsar Noor,
Daniel Bean,
Aurelie Mascio,
Leilei Zhu,
Amos A Folarin,
Angus Roberts,
Rebecca Bendayan,
Mark P Richardson,
Robert Stewart,
Anoop D Shah,
Wai Keong Wong,
Zina Ibrahim,
James T Teo,
Richard JB Dobson
Abstract:
Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open-source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a f…
▽ More
Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open-source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a feature-rich annotation interface for customising and training IE models; and c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ~8.8B words from ~17M clinical records and further fine-tuning with ~6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets, and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.
△ Less
Submitted 25 March, 2021; v1 submitted 2 October, 2020;
originally announced October 2020.
-
Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research
Authors:
Benjamin Birnbaum,
Nathan Nussbaum,
Katharina Seidl-Rathkopf,
Monica Agrawal,
Melissa Estevez,
Evan Estola,
Joshua Haimson,
Lucy He,
Peter Larson,
Paul Richardson
Abstract:
Objective Electronic health records (EHRs) are a promising source of data for health outcomes research in oncology. A challenge in using EHR data is that selecting cohorts of patients often requires information in unstructured parts of the record. Machine learning has been used to address this, but even high-performing algorithms may select patients in a non-random manner and bias the resulting co…
▽ More
Objective Electronic health records (EHRs) are a promising source of data for health outcomes research in oncology. A challenge in using EHR data is that selecting cohorts of patients often requires information in unstructured parts of the record. Machine learning has been used to address this, but even high-performing algorithms may select patients in a non-random manner and bias the resulting cohort. To improve the efficiency of cohort selection while measuring potential bias, we introduce a technique called Model-Assisted Cohort Selection (MACS) with Bias Analysis and apply it to the selection of metastatic breast cancer (mBC) patients. Materials and Methods We trained a model on 17,263 patients using term-frequency inverse-document-frequency (TF-IDF) and logistic regression. We used a test set of 17,292 patients to measure algorithm performance and perform Bias Analysis. We compared the cohort generated by MACS to the cohort that would have been generated without MACS as reference standard, first by comparing distributions of an extensive set of clinical and demographic variables and then by comparing the results of two analyses addressing existing example research questions. Results Our algorithm had an area under the curve (AUC) of 0.976, a sensitivity of 96.0%, and an abstraction efficiency gain of 77.9%. During Bias Analysis, we found no large differences in baseline characteristics and no differences in the example analyses. Conclusion MACS with bias analysis can significantly improve the efficiency of cohort selection on EHR data while instilling confidence that outcomes research performed on the resulting cohort will not be biased.
△ Less
Submitted 13 January, 2020;
originally announced January 2020.
-
Counting fixed points and rooted closed walks of the singular map $x \mapsto x^{x^n}$ modulo powers of a prime
Authors:
Joshua Holden,
Pamela A. Richardson,
Margaret M. Robinson
Abstract:
The "self-power" map $x \mapsto x^x$ modulo $m$ and its generalized form $x \mapsto x^{x^n}$ modulo $m$ are of considerable interest for both theoretical reasons and for potential applications to cryptography. In this paper, we use $p$-adic methods, primarily $p$-adic interpolation, Hensel's lemma, and lifting singular points modulo $p$, to count fixed points and rooted closed walks of equations r…
▽ More
The "self-power" map $x \mapsto x^x$ modulo $m$ and its generalized form $x \mapsto x^{x^n}$ modulo $m$ are of considerable interest for both theoretical reasons and for potential applications to cryptography. In this paper, we use $p$-adic methods, primarily $p$-adic interpolation, Hensel's lemma, and lifting singular points modulo $p$, to count fixed points and rooted closed walks of equations related to these maps when $m$ is a prime power. In particular, we introduce a new technique for lifting singular solutions of several congruences in several unknowns using the left kernel of the Jacobian matrix.
△ Less
Submitted 26 May, 2020; v1 submitted 21 September, 2016;
originally announced September 2016.