-
RelCAT: Advancing Extraction of Clinical Inter-Entity Relationships from Unstructured Electronic Health Records
Authors:
Shubham Agarwal,
Vlad Dinu,
Thomas Searle,
Mart Ratas,
Anthony Shek,
Dan F. Stein,
James Teo,
Richard Dobson
Abstract:
This study introduces RelCAT (Relation Concept Annotation Toolkit), an interactive tool, library, and workflow designed to classify relations between entities extracted from clinical narratives. Building upon the CogStack MedCAT framework, RelCAT addresses the challenge of capturing complete clinical relations dispersed within text. The toolkit implements state-of-the-art machine learning models s…
▽ More
This study introduces RelCAT (Relation Concept Annotation Toolkit), an interactive tool, library, and workflow designed to classify relations between entities extracted from clinical narratives. Building upon the CogStack MedCAT framework, RelCAT addresses the challenge of capturing complete clinical relations dispersed within text. The toolkit implements state-of-the-art machine learning models such as BERT and Llama along with proven evaluation and training methods. We demonstrate a dataset annotation tool (built within MedCATTrainer), model training, and evaluate our methodology on both openly available gold-standard and real-world UK National Health Service (NHS) hospital clinical datasets. We perform extensive experimentation and a comparative analysis of the various publicly available models with varied approaches selected for model fine-tuning. Finally, we achieve macro F1-scores of 0.977 on the gold-standard n2c2, surpassing the previous state-of-the-art performance, and achieve performance of >=0.93 F1 on our NHS gathered datasets.
△ Less
Submitted 27 January, 2025;
originally announced January 2025.
-
Large Language Models for Medical Forecasting -- Foresight 2
Authors:
Zeljko Kraljevic,
Joshua Au Yeung,
Daniel Bean,
James Teo,
Richard J. Dobson
Abstract:
Foresight 2 (FS2) is a large language model fine-tuned on hospital data for modelling patient timelines (GitHub 'removed for anon'). It can understand patients' clinical notes and predict SNOMED codes for a wide range of biomedical use cases, including diagnosis suggestions, risk forecasting, and procedure and medication recommendations. FS2 is trained on the free text portion of the MIMIC-III dat…
▽ More
Foresight 2 (FS2) is a large language model fine-tuned on hospital data for modelling patient timelines (GitHub 'removed for anon'). It can understand patients' clinical notes and predict SNOMED codes for a wide range of biomedical use cases, including diagnosis suggestions, risk forecasting, and procedure and medication recommendations. FS2 is trained on the free text portion of the MIMIC-III dataset, firstly through extracting biomedical concepts and then creating contextualised patient timelines, upon which the model is then fine-tuned. The results show significant improvement over the previous state-of-the-art for the next new biomedical concept prediction (P/R - 0.73/0.66 vs 0.52/0.32) and a similar improvement specifically for the next new disorder prediction (P/R - 0.69/0.62 vs 0.46/0.25). Finally, on the task of risk forecast, we compare our model to GPT-4-turbo (and a range of open-source biomedical LLMs) and show that FS2 performs significantly better on such tasks (P@5 - 0.90 vs 0.65). This highlights the need to incorporate hospital data into LLMs and shows that small models outperform much larger ones when fine-tuned on high-quality, specialised data.
△ Less
Submitted 14 December, 2024;
originally announced December 2024.
-
Unboxing Virgil ADTs for Fun and Profit
Authors:
Bradley Wei Jie Teo,
Ben L. Titzer
Abstract:
Algebraic Data Types (ADTs) are an increasingly common feature in modern programming languages. In many implementations, values of non-nullary, multi-case ADTs are allocated on the heap, which may reduce performance and increase memory usage. This work explores annotation-guided optimizations to ADT representation in Virgil, a systems-level programming language that compiles to x86, x86-64, Wasm a…
▽ More
Algebraic Data Types (ADTs) are an increasingly common feature in modern programming languages. In many implementations, values of non-nullary, multi-case ADTs are allocated on the heap, which may reduce performance and increase memory usage. This work explores annotation-guided optimizations to ADT representation in Virgil, a systems-level programming language that compiles to x86, x86-64, Wasm and the Java Virtual Machine.
We extend Virgil with annotations: #unboxed to eliminate the overhead of heap allocation via automatic compiler transformation to a scalar representation, and #packed, to enable programmer-expressed bit-layouts. These annotations allow programmers to both save memory and manipulate data in formats dictated by hardware. We dedicate this work as an homage and echo of work done in collaboration with Jens in the work entitled "A Declarative Approach to Generating Machine Code Tools", an unpublished manuscript from 2005. In fact, this work inherits some syntactic conventions from that prior work. The performance impact of these representation changes was evaluated on a variety of workloads in terms of execution time and memory usage, but we don't include it because Jens like semantics and type systems better!
△ Less
Submitted 14 October, 2024;
originally announced October 2024.
-
Improving Extraction of Clinical Event Contextual Properties from Electronic Health Records: A Comparative Study
Authors:
Shubham Agarwal,
Thomas Searle,
Mart Ratas,
Anthony Shek,
James Teo,
Richard Dobson
Abstract:
Electronic Health Records are large repositories of valuable clinical data, with a significant portion stored in unstructured text format. This textual data includes clinical events (e.g., disorders, symptoms, findings, medications and procedures) in context that if extracted accurately at scale can unlock valuable downstream applications such as disease prediction. Using an existing Named Entity…
▽ More
Electronic Health Records are large repositories of valuable clinical data, with a significant portion stored in unstructured text format. This textual data includes clinical events (e.g., disorders, symptoms, findings, medications and procedures) in context that if extracted accurately at scale can unlock valuable downstream applications such as disease prediction. Using an existing Named Entity Recognition and Linking methodology, MedCAT, these identified concepts need to be further classified (contextualised) for their relevance to the patient, and their temporal and negated status for example, to be useful downstream. This study performs a comparative analysis of various natural language models for medical text classification. Extensive experimentation reveals the effectiveness of transformer-based language models, particularly BERT. When combined with class imbalance mitigation techniques, BERT outperforms Bi-LSTM models by up to 28% and the baseline BERT model by up to 16% for recall of the minority classes. The method has been implemented as part of CogStack/MedCAT framework and made available to the community for further research.
△ Less
Submitted 30 August, 2024;
originally announced August 2024.
-
DHR+S: Distributed Hybrid Rendering with Realistic Real-time Shadows for Interactive Thin Client Metaverse and Game Applications
Authors:
Yu Wei Tan,
Siang Ern Low,
Jonas Chow,
Javon Teo,
Anand Bhojan
Abstract:
Distributed hybrid rendering (DHR) is a real-time rendering approach that incorporates cloud-based ray tracing with locally rasterized graphics for interactive thin client metaverse and game applications. With cloud assistance, DHR can generate high-fidelity ray-traced graphics contents remotely and deliver them to thin clients with low graphics capability, including standalone extended reality de…
▽ More
Distributed hybrid rendering (DHR) is a real-time rendering approach that incorporates cloud-based ray tracing with locally rasterized graphics for interactive thin client metaverse and game applications. With cloud assistance, DHR can generate high-fidelity ray-traced graphics contents remotely and deliver them to thin clients with low graphics capability, including standalone extended reality devices and mobile phones, while maintaining interactive frame rates for users under adverse network conditions. DHR can already achieve the effect of ray-traced hard shadows that form with the occlusion of direct illumination. We enhance the realism of these shadows by softening their edges with the direction of rays traced and approximating the occlusion of indirect illumination by reconstructing ray-traced ambient occlusion with a modified version of spatiotemporal variance-guided filtering. Our technique uses only 20-30% of the bandwidth of remote rendering and is also tolerant of delays of up to 200 ms with only slight distortion to the shadows along object edges.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Framework to generate perfusion map from CT and CTA images in patients with acute ischemic stroke: A longitudinal and cross-sectional study
Authors:
Chayanin Tangwiriyasakul,
Pedro Borges,
Stefano Moriconi,
Paul Wright,
Yee-Haur Mah,
James Teo,
Parashkev Nachev,
Sebastien Ourselin,
M. Jorge Cardoso
Abstract:
Stroke is a leading cause of disability and death. Effective treatment decisions require early and informative vascular imaging. 4D perfusion imaging is ideal but rarely available within the first hour after stroke, whereas plain CT and CTA usually are. Hence, we propose a framework to extract a predicted perfusion map (PPM) derived from CT and CTA images. In all eighteen patients, we found signif…
▽ More
Stroke is a leading cause of disability and death. Effective treatment decisions require early and informative vascular imaging. 4D perfusion imaging is ideal but rarely available within the first hour after stroke, whereas plain CT and CTA usually are. Hence, we propose a framework to extract a predicted perfusion map (PPM) derived from CT and CTA images. In all eighteen patients, we found significantly high spatial similarity (with average Spearman's correlation = 0.7893) between our predicted perfusion map (PPM) and the T-max map derived from 4D-CTP. Voxelwise correlations between the PPM and National Institutes of Health Stroke Scale (NIHSS) subscores for L/R hand motor, gaze, and language on a large cohort of 2,110 subjects reliably mapped symptoms to expected infarct locations. Therefore our PPM could serve as an alternative for 4D perfusion imaging, if the latter is unavailable, to investigate blood perfusion in the first hours after hospital admission.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Flexible Non-intrusive Dynamic Instrumentation for WebAssembly
Authors:
Ben L. Titzer,
Elizabeth Gilbert,
Bradley Wei Jie Teo,
Yash Anand,
Kazuyuki Takayama,
Heather Miller
Abstract:
A key strength of managed runtimes over hardware is the ability to gain detailed insight into the dynamic execution of programs with instrumentation. Analyses such as code coverage, execution frequency, tracing, and debugging, are all made easier in a virtual setting. As a portable, low-level bytecode, WebAssembly offers inexpensive in-process sandboxing with high performance. Yet to date, Wasm en…
▽ More
A key strength of managed runtimes over hardware is the ability to gain detailed insight into the dynamic execution of programs with instrumentation. Analyses such as code coverage, execution frequency, tracing, and debugging, are all made easier in a virtual setting. As a portable, low-level bytecode, WebAssembly offers inexpensive in-process sandboxing with high performance. Yet to date, Wasm engines have not offered much insight into executing programs, supporting at best bytecode-level stepping and basic source maps, but no instrumentation capabilities. In this paper, we show the first non-intrusive dynamic instrumentation system for WebAssembly in the open-source Wizard Research Engine. Our innovative design offers a flexible, complete hierarchy of instrumentation primitives that support building high-level, complex analyses in terms of low-level, programmable probes. In contrast to emulation or machine code instrumentation, injecting probes at the bytecode level increases expressiveness and vastly simplifies the implementation by reusing the engine's JIT compiler, interpreter, and deoptimization mechanism rather than building new ones. Wizard supports both dynamic instrumentation insertion and removal while providing consistency guarantees, which is key to composing multiple analyses without interference. We detail a fully-featured implementation in a high-performance multi-tier Wasm engine, show novel optimizations specifically designed to minimize instrumentation overhead, and evaluate performance characteristics under load from various analyses. This design is well-suited for production engine adoption as probes can be implemented to have no impact on production performance when not in use.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Validating transformers for redaction of text from electronic health records in real-world healthcare
Authors:
Zeljko Kraljevic,
Anthony Shek,
Joshua Au Yeung,
Ewart Jonathan Sheldon,
Mohammad Al-Agil,
Haris Shuaib,
Xi Bai,
Kawsar Noor,
Anoop D. Shah,
Richard Dobson,
James Teo
Abstract:
Protecting patient privacy in healthcare records is a top priority, and redaction is a commonly used method for obscuring directly identifiable information in text. Rule-based methods have been widely used, but their precision is often low causing over-redaction of text and frequently not being adaptable enough for non-standardised or unconventional structures of personal health information. Deep…
▽ More
Protecting patient privacy in healthcare records is a top priority, and redaction is a commonly used method for obscuring directly identifiable information in text. Rule-based methods have been widely used, but their precision is often low causing over-redaction of text and frequently not being adaptable enough for non-standardised or unconventional structures of personal health information. Deep learning techniques have emerged as a promising solution, but implementing them in real-world environments poses challenges due to the differences in patient record structure and language across different departments, hospitals, and countries.
In this study, we present AnonCAT, a transformer-based model and a blueprint on how deidentification models can be deployed in real-world healthcare. AnonCAT was trained through a process involving manually annotated redactions of real-world documents from three UK hospitals with different electronic health record systems and 3116 documents. The model achieved high performance in all three hospitals with a Recall of 0.99, 0.99 and 0.96.
Our findings demonstrate the potential of deep learning techniques for improving the efficiency and accuracy of redaction in global healthcare data and highlight the importance of building workflows which not just use these models but are also able to continually fine-tune and audit the performance of these algorithms to ensure continuing effectiveness in real-world settings. This approach provides a blueprint for the real-world use of de-identifying algorithms through fine-tuning and localisation, the code together with tutorials is available on GitHub (https://github.com/CogStack/MedCAT).
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Unsupervised 3D out-of-distribution detection with latent diffusion models
Authors:
Mark S. Graham,
Walter Hugo Lopez Pinaya,
Paul Wright,
Petru-Daniel Tudosiu,
Yee H. Mah,
James T. Teo,
H. Rolf Jäger,
David Werring,
Parashkev Nachev,
Sebastien Ourselin,
M. Jorge Cardoso
Abstract:
Methods for out-of-distribution (OOD) detection that scale to 3D data are crucial components of any real-world clinical deep learning system. Classic denoising diffusion probabilistic models (DDPMs) have been recently proposed as a robust way to perform reconstruction-based OOD detection on 2D datasets, but do not trivially scale to 3D data. In this work, we propose to use Latent Diffusion Models…
▽ More
Methods for out-of-distribution (OOD) detection that scale to 3D data are crucial components of any real-world clinical deep learning system. Classic denoising diffusion probabilistic models (DDPMs) have been recently proposed as a robust way to perform reconstruction-based OOD detection on 2D datasets, but do not trivially scale to 3D data. In this work, we propose to use Latent Diffusion Models (LDMs), which enable the scaling of DDPMs to high-resolution 3D medical data. We validate the proposed approach on near- and far-OOD datasets and compare it to a recently proposed, 3D-enabled approach using Latent Transformer Models (LTMs). Not only does the proposed LDM-based approach achieve statistically significant better performance, it also shows less sensitivity to the underlying latent representation, more favourable memory scaling, and produces better spatial anomaly maps. Code is available at https://github.com/marksgraham/ddpm-ood
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Foresight -- Generative Pretrained Transformer (GPT) for Modelling of Patient Timelines using EHRs
Authors:
Zeljko Kraljevic,
Dan Bean,
Anthony Shek,
Rebecca Bendayan,
Harry Hemingway,
Joshua Au Yeung,
Alexander Deng,
Alfie Baston,
Jack Ross,
Esther Idowu,
James T Teo,
Richard J Dobson
Abstract:
Background: Electronic Health Records hold detailed longitudinal information about each patient's health status and general clinical history, a large portion of which is stored within the unstructured text. Existing approaches focus mostly on structured data and a subset of single-domain outcomes. We explore how temporal modelling of patients from free text and structured data, using deep generati…
▽ More
Background: Electronic Health Records hold detailed longitudinal information about each patient's health status and general clinical history, a large portion of which is stored within the unstructured text. Existing approaches focus mostly on structured data and a subset of single-domain outcomes. We explore how temporal modelling of patients from free text and structured data, using deep generative transformers can be used to forecast a wide range of future disorders, substances, procedures or findings. Methods: We present Foresight, a novel transformer-based pipeline that uses named entity recognition and linking tools to convert document text into structured, coded concepts, followed by providing probabilistic forecasts for future medical events such as disorders, substances, procedures and findings. We processed the entire free-text portion from three different hospital datasets totalling 811336 patients covering both physical and mental health. Findings: On tests in two UK hospitals (King's College Hospital, South London and Maudsley) and the US MIMIC-III dataset precision@10 0.68, 0.76 and 0.88 was achieved for forecasting the next disorder in a patient timeline, while precision@10 of 0.80, 0.81 and 0.91 was achieved for forecasting the next biomedical concept. Foresight was also validated on 34 synthetic patient timelines by five clinicians and achieved relevancy of 97% for the top forecasted candidate disorder. As a generative model, it can forecast follow-on biomedical concepts for as many steps as required. Interpretation: Foresight is a general-purpose model for biomedical concept modelling that can be used for real-world risk forecasting, virtual trials and clinical research to study the progression of disorders, simulate interventions and counterfactuals, and educational purposes.
△ Less
Submitted 24 January, 2023; v1 submitted 13 December, 2022;
originally announced December 2022.
-
Discharge Summary Hospital Course Summarisation of In Patient Electronic Health Record Text with Clinical Concept Guided Deep Pre-Trained Transformer Models
Authors:
Thomas Searle,
Zina Ibrahim,
James Teo,
Richard Dobson
Abstract:
Brief Hospital Course (BHC) summaries are succinct summaries of an entire hospital encounter, embedded within discharge summaries, written by senior clinicians responsible for the overall care of a patient. Methods to automatically produce summaries from inpatient documentation would be invaluable in reducing clinician manual burden of summarising documents under high time-pressure to admit and di…
▽ More
Brief Hospital Course (BHC) summaries are succinct summaries of an entire hospital encounter, embedded within discharge summaries, written by senior clinicians responsible for the overall care of a patient. Methods to automatically produce summaries from inpatient documentation would be invaluable in reducing clinician manual burden of summarising documents under high time-pressure to admit and discharge patients. Automatically producing these summaries from the inpatient course, is a complex, multi-document summarisation task, as source notes are written from various perspectives (e.g. nursing, doctor, radiology), during the course of the hospitalisation. We demonstrate a range of methods for BHC summarisation demonstrating the performance of deep learning summarisation models across extractive and abstractive summarisation scenarios. We also test a novel ensemble extractive and abstractive summarisation model that incorporates a medical concept ontology (SNOMED) as a clinical guidance signal and shows superior performance in 2 real-world clinical data sets.
△ Less
Submitted 10 April, 2023; v1 submitted 14 November, 2022;
originally announced November 2022.
-
Fast Unsupervised Brain Anomaly Detection and Segmentation with Diffusion Models
Authors:
Walter H. L. Pinaya,
Mark S. Graham,
Robert Gray,
Pedro F Da Costa,
Petru-Daniel Tudosiu,
Paul Wright,
Yee H. Mah,
Andrew D. MacKinnon,
James T. Teo,
Rolf Jager,
David Werring,
Geraint Rees,
Parashkev Nachev,
Sebastien Ourselin,
M. Jorge Cardoso
Abstract:
Deep generative models have emerged as promising tools for detecting arbitrary anomalies in data, dispensing with the necessity for manual labelling. Recently, autoregressive transformers have achieved state-of-the-art performance for anomaly detection in medical imaging. Nonetheless, these models still have some intrinsic weaknesses, such as requiring images to be modelled as 1D sequences, the ac…
▽ More
Deep generative models have emerged as promising tools for detecting arbitrary anomalies in data, dispensing with the necessity for manual labelling. Recently, autoregressive transformers have achieved state-of-the-art performance for anomaly detection in medical imaging. Nonetheless, these models still have some intrinsic weaknesses, such as requiring images to be modelled as 1D sequences, the accumulation of errors during the sampling process, and the significant inference times associated with transformers. Denoising diffusion probabilistic models are a class of non-autoregressive generative models recently shown to produce excellent samples in computer vision (surpassing Generative Adversarial Networks), and to achieve log-likelihoods that are competitive with transformers while having fast inference times. Diffusion models can be applied to the latent representations learnt by autoencoders, making them easily scalable and great candidates for application to high dimensional data, such as medical images. Here, we propose a method based on diffusion models to detect and segment anomalies in brain imaging. By training the models on healthy data and then exploring its diffusion and reverse steps across its Markov chain, we can identify anomalous areas in the latent space and hence identify anomalies in the pixel space. Our diffusion models achieve competitive performance compared with autoregressive approaches across a series of experiments with 2D CT and MRI data involving synthetic and real pathological lesions with much reduced inference times, making their usage clinically viable.
△ Less
Submitted 7 June, 2022;
originally announced June 2022.
-
Transformer-based out-of-distribution detection for clinically safe segmentation
Authors:
Mark S Graham,
Petru-Daniel Tudosiu,
Paul Wright,
Walter Hugo Lopez Pinaya,
U Jean-Marie,
Yee Mah,
James Teo,
Rolf H Jäger,
David Werring,
Parashkev Nachev,
Sebastien Ourselin,
M Jorge Cardoso
Abstract:
In a clinical setting it is essential that deployed image processing systems are robust to the full range of inputs they might encounter and, in particular, do not make confidently wrong predictions. The most popular approach to safe processing is to train networks that can provide a measure of their uncertainty, but these tend to fail for inputs that are far outside the training data distribution…
▽ More
In a clinical setting it is essential that deployed image processing systems are robust to the full range of inputs they might encounter and, in particular, do not make confidently wrong predictions. The most popular approach to safe processing is to train networks that can provide a measure of their uncertainty, but these tend to fail for inputs that are far outside the training data distribution. Recently, generative modelling approaches have been proposed as an alternative; these can quantify the likelihood of a data sample explicitly, filtering out any out-of-distribution (OOD) samples before further processing is performed. In this work, we focus on image segmentation and evaluate several approaches to network uncertainty in the far-OOD and near-OOD cases for the task of segmenting haemorrhages in head CTs. We find all of these approaches are unsuitable for safe segmentation as they provide confidently wrong predictions when operating OOD. We propose performing full 3D OOD detection using a VQ-GAN to provide a compressed latent representation of the image and a transformer to estimate the data likelihood. Our approach successfully identifies images in both the far- and near-OOD cases. We find a strong relationship between image likelihood and the quality of a model's segmentation, making this approach viable for filtering images unsuitable for segmentation. To our knowledge, this is the first time transformers have been applied to perform OOD detection on 3D image data. Code is available at github.com/marksgraham/transformer-ood.
△ Less
Submitted 17 May, 2023; v1 submitted 21 May, 2022;
originally announced May 2022.
-
Exploring Unsupervised Learning Methods for Automated Protocol Analysis
Authors:
Arijit Dasgupta,
Yi-Xue Yan,
Clarence Ong,
Jenn-Yue Teo,
Chia-Wei Lim
Abstract:
The ability to analyse and differentiate network protocol traffic is crucial for network resource management to provide differentiated services by Telcos. Automated Protocol Analysis (APA) is crucial to significantly improve efficiency and reduce reliance on human experts. There are numerous automated state-of-the-art unsupervised methods for clustering unknown protocols in APA. However, many such…
▽ More
The ability to analyse and differentiate network protocol traffic is crucial for network resource management to provide differentiated services by Telcos. Automated Protocol Analysis (APA) is crucial to significantly improve efficiency and reduce reliance on human experts. There are numerous automated state-of-the-art unsupervised methods for clustering unknown protocols in APA. However, many such methods have not been sufficiently explored using diverse test datasets. Thus failing to demonstrate their robustness to generalise. This study proposed a comprehensive framework to evaluate various combinations of feature extraction and clustering methods in APA. It also proposed a novel approach to automate selection of dataset dependent model parameters for feature extraction, resulting in improved performance. Promising results of a novel field-based tokenisation approach also led to our proposal of a novel automated hybrid approach for feature extraction and clustering of unknown protocols in APA. Our proposed hybrid approach performed the best in 7 out of 9 of the diverse test datasets, thus displaying the robustness to generalise across diverse unknown protocols. It also outperformed the unsupervised clustering technique in state-of-the-art open-source APA tool, NETZOB in all test datasets.
△ Less
Submitted 17 November, 2021;
originally announced November 2021.
-
MedGPT: Medical Concept Prediction from Clinical Narratives
Authors:
Zeljko Kraljevic,
Anthony Shek,
Daniel Bean,
Rebecca Bendayan,
James Teo,
Richard Dobson
Abstract:
The data available in Electronic Health Records (EHRs) provides the opportunity to transform care, and the best way to provide better care for one patient is through learning from the data available on all other patients. Temporal modelling of a patient's medical history, which takes into account the sequence of past events, can be used to predict future events such as a diagnosis of a new disorde…
▽ More
The data available in Electronic Health Records (EHRs) provides the opportunity to transform care, and the best way to provide better care for one patient is through learning from the data available on all other patients. Temporal modelling of a patient's medical history, which takes into account the sequence of past events, can be used to predict future events such as a diagnosis of a new disorder or complication of a previous or existing disorder. While most prediction approaches use mostly the structured data in EHRs or a subset of single-domain predictions and outcomes, we present MedGPT a novel transformer-based pipeline that uses Named Entity Recognition and Linking tools (i.e. MedCAT) to structure and organize the free text portion of EHRs and anticipate a range of future medical events (initially disorders). Since a large portion of EHR data is in text form, such an approach benefits from a granular and detailed view of a patient while introducing modest additional noise. MedGPT effectively deals with the noise and the added granularity, and achieves a precision of 0.344, 0.552 and 0.640 (vs LSTM 0.329, 0.538 and 0.633) when predicting the top 1, 3 and 5 candidate future disorders on real world hospital data from King's College Hospital, London, UK (\textasciitilde600k patients). We also show that our model captures medical knowledge by testing it on an experimental medical multiple choice question answering task, and by examining the attentional focus of the model using gradient-based saliency methods.
△ Less
Submitted 7 July, 2021;
originally announced July 2021.
-
Estimating Redundancy in Clinical Text
Authors:
Thomas Searle,
Zina Ibrahim,
James Teo,
Richard JB Dobson
Abstract:
The current mode of use of Electronic Health Record (EHR) elicits text redundancy. Clinicians often populate new documents by duplicating existing notes, then updating accordingly. Data duplication can lead to a propagation of errors, inconsistencies and misreporting of care. Therefore, quantifying information redundancy can play an essential role in evaluating innovations that operate on clinical…
▽ More
The current mode of use of Electronic Health Record (EHR) elicits text redundancy. Clinicians often populate new documents by duplicating existing notes, then updating accordingly. Data duplication can lead to a propagation of errors, inconsistencies and misreporting of care. Therefore, quantifying information redundancy can play an essential role in evaluating innovations that operate on clinical narratives.
This work is a quantitative examination of information redundancy in EHR notes. We present and evaluate two strategies to measure redundancy: an information-theoretic approach and a lexicosyntactic and semantic model. We evaluate the measures by training large Transformer-based language models using clinical text from a large openly available US-based ICU dataset and a large multi-site UK based Trust. By comparing the information-theoretic content of the trained models with open-domain language models, the language models trained using clinical text have shown ~1.5x to ~3x less efficient than open-domain corpora. Manual evaluation shows a high correlation with lexicosyntactic and semantic redundancy, with averages ~43 to ~65%.
△ Less
Submitted 26 October, 2021; v1 submitted 25 May, 2021;
originally announced May 2021.
-
A Knowledge Distillation Ensemble Framework for Predicting Short and Long-term Hospitalisation Outcomes from Electronic Health Records Data
Authors:
Zina M Ibrahim,
Daniel Bean,
Thomas Searle,
Honghan Wu,
Anthony Shek,
Zeljko Kraljevic,
James Galloway,
Sam Norton,
James T Teo,
Richard JB Dobson
Abstract:
The ability to perform accurate prognosis of patients is crucial for proactive clinical decision making, informed resource management and personalised care. Existing outcome prediction models suffer from a low recall of infrequent positive outcomes. We present a highly-scalable and robust machine learning framework to automatically predict adversity represented by mortality and ICU admission from…
▽ More
The ability to perform accurate prognosis of patients is crucial for proactive clinical decision making, informed resource management and personalised care. Existing outcome prediction models suffer from a low recall of infrequent positive outcomes. We present a highly-scalable and robust machine learning framework to automatically predict adversity represented by mortality and ICU admission from time-series vital signs and laboratory results obtained within the first 24 hours of hospital admission. The stacked platform comprises two components: a) an unsupervised LSTM Autoencoder that learns an optimal representation of the time-series, using it to differentiate the less frequent patterns which conclude with an adverse event from the majority patterns that do not, and b) a gradient boosting model, which relies on the constructed representation to refine prediction, incorporating static features of demographics, admission details and clinical summaries. The model is used to assess a patient's risk of adversity over time and provides visual justifications of its prediction based on the patient's static features and dynamic signals. Results of three case studies for predicting mortality and ICU admission show that the model outperforms all existing outcome prediction models, achieving PR-AUC of 0.891 (95$%$ CI: 0.878 - 0.969) in predicting mortality in ICU and general ward settings and 0.908 (95$%$ CI: 0.870-0.935) in predicting ICU admission.
△ Less
Submitted 11 June, 2021; v1 submitted 18 November, 2020;
originally announced November 2020.
-
Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit
Authors:
Zeljko Kraljevic,
Thomas Searle,
Anthony Shek,
Lukasz Roguski,
Kawsar Noor,
Daniel Bean,
Aurelie Mascio,
Leilei Zhu,
Amos A Folarin,
Angus Roberts,
Rebecca Bendayan,
Mark P Richardson,
Robert Stewart,
Anoop D Shah,
Wai Keong Wong,
Zina Ibrahim,
James T Teo,
Richard JB Dobson
Abstract:
Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open-source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a f…
▽ More
Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open-source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a feature-rich annotation interface for customising and training IE models; and c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ~8.8B words from ~17M clinical records and further fine-tuning with ~6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets, and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.
△ Less
Submitted 25 March, 2021; v1 submitted 2 October, 2020;
originally announced October 2020.
-
Data-Driven Multi-Objective Controller Optimization for a Magnetically-Levitated Nanopositioning System
Authors:
Xiaocong Li,
Haiyue Zhu,
Jun Ma,
Tat Joo Teo,
Chek Sing Teo,
Masayoshi Tomizuka,
Tong Heng Lee
Abstract:
The performance achieved with traditional model-based control system design approaches typically relies heavily upon accurate modeling of the motion dynamics. However, modeling the true dynamics of present-day increasingly complex systems can be an extremely challenging task; and the usually necessary practical approximations often render the automation system to operate in a non-optimal condition…
▽ More
The performance achieved with traditional model-based control system design approaches typically relies heavily upon accurate modeling of the motion dynamics. However, modeling the true dynamics of present-day increasingly complex systems can be an extremely challenging task; and the usually necessary practical approximations often render the automation system to operate in a non-optimal condition. This problem can be greatly aggravated in the case of a multi-axis magnetically-levitated nanopositioning system where the fully floating behavior and multi-axis coupling make extremely accurate identification of the motion dynamics largely impossible. On the other hand, in many related industrial automation applications, e.g., the scanning process with the maglev system, repetitive motions are involved which could generate a large amount of motion data under non-optimal conditions. These motion data essentially contain rich information; therefore, the possibility exists to develop an intelligent automation system to learn from these motion data and to drive the system to operate towards optimality in a data-driven manner. Along this line then, this paper proposes a data-driven controller optimization approach that learns from the past non-optimal motion data to iteratively improve the motion control performance. Specifically, a novel data-driven multi-objective optimization approach is proposed that is able to automatically estimate the gradient and Hessian purely based on the measured motion data; the multi-objective cost function is suitably designed to take into account both smooth and accurate trajectory tracking. Experiments are then conducted on the maglev nanopositioning system to demonstrate the effectiveness of the proposed method, and the results show rather clearly the practical appeal of our methodology for related complex robotic systems with no accurate model available.
△ Less
Submitted 6 July, 2020;
originally announced July 2020.
-
Massive Open and Online Courses and Open Education Resources in Singapore
Authors:
Victor Lim,
Lawrence Wee,
Jessica Teo,
Shannalyn Ng
Abstract:
This paper looks at the increasing popularity of massive open and online courses (MOOCs) and open educational resources (OERs) offered in Singapore. Despite being a relatively new phenomenon, the Singapore government has collaborated with different organizations to improve the quality and accessibility of MOOCs, and many institutions of higher learning (IHLs) are spearheading efforts to improve OE…
▽ More
This paper looks at the increasing popularity of massive open and online courses (MOOCs) and open educational resources (OERs) offered in Singapore. Despite being a relatively new phenomenon, the Singapore government has collaborated with different organizations to improve the quality and accessibility of MOOCs, and many institutions of higher learning (IHLs) are spearheading efforts to improve OERs to facilitate greater public access to educational resources. It will also explore the benefits and potential problems that MOOCs and OERs face. For example, both MOOCs and OERs are able to lower the costs of university-level education and increase public access to such courses. They also provide skills and job training for members of the public as well as encourage lifelong learning. However, both MOOCs and OERs may not be sustainable in the long run, as the financial gains of both may not be able to cover the costs of mounting them. Each system also has its own set of problems. For example, formal structures to guarantee the quality of MOOCs offered remain lacking. MOOCs also tend to have low completion rates and there have been issues regarding plagiarism with the use of MOOCs as learning platforms. OERs pose challenges to traditional copyright policies while lack of sustainable funding prevents them from being adopted more widely. Even though both systems may potentially transform the traditional education system, a deeper understanding of MOOCs and OERs as well as their implications on learning will be useful.
△ Less
Submitted 15 August, 2017;
originally announced August 2017.
-
A Smart Cushion for Real-Time Heart Rate Monitoring
Authors:
Chacko John Deepu,
Zhihao Chen,
Ju Teng Teo,
Soon Huat Ng,
Xiefeng Yang,
Yong Lian
Abstract:
This paper presents a smart cushion for real time heart rate monitoring. The cushion comprises of an integrated micro-bending fiber sensor, which records the BCG (Ballistocardiogram) signal without direct skin-electrode contact, and an optical transceiver that does signal amplification, digitization, and pre-filtering. To remove the artifacts and extract heart rate from BCG signal, a computational…
▽ More
This paper presents a smart cushion for real time heart rate monitoring. The cushion comprises of an integrated micro-bending fiber sensor, which records the BCG (Ballistocardiogram) signal without direct skin-electrode contact, and an optical transceiver that does signal amplification, digitization, and pre-filtering. To remove the artifacts and extract heart rate from BCG signal, a computationally efficient heart rate detection algorithm is developed. The system doesn't require any pre-training and is highly responsive with the outputs updated every 3 sec and initial response within first 10 sec. Tests conducted on human subjects show the detected heart rate closely matches the one from a commercial SpO2 device.
△ Less
Submitted 29 September, 2014;
originally announced September 2014.