-
Large Language Models for Medical OSCE Assessment: A Novel Approach to Transcript Analysis
Authors:
Ameer Hamza Shakur,
Michael J. Holcomb,
David Hein,
Shinyoung Kang,
Thomas O. Dalton,
Krystle K. Campbell,
Daniel J. Scott,
Andrew R. Jamieson
Abstract:
Grading Objective Structured Clinical Examinations (OSCEs) is a time-consuming and expensive process, traditionally requiring extensive manual effort from human experts. In this study, we explore the potential of Large Language Models (LLMs) to assess skills related to medical student communication. We analyzed 2,027 video-recorded OSCE examinations from the University of Texas Southwestern Medica…
▽ More
Grading Objective Structured Clinical Examinations (OSCEs) is a time-consuming and expensive process, traditionally requiring extensive manual effort from human experts. In this study, we explore the potential of Large Language Models (LLMs) to assess skills related to medical student communication. We analyzed 2,027 video-recorded OSCE examinations from the University of Texas Southwestern Medical Center (UTSW), spanning four years (2019-2022), and several different medical cases or "stations." Specifically, our focus was on evaluating students' ability to summarize patients' medical history: we targeted the rubric item 'did the student summarize the patients' medical history?' from the communication skills rubric. After transcribing speech audio captured by OSCE videos using Whisper-v3, we studied the performance of various LLM-based approaches for grading students on this summarization task based on their examination transcripts. Using various frontier-level open-source and proprietary LLMs, we evaluated different techniques such as zero-shot chain-of-thought prompting, retrieval augmented generation, and multi-model ensemble methods. Our results show that frontier LLM models like GPT-4 achieved remarkable alignment with human graders, demonstrating a Cohen's kappa agreement of 0.88 and indicating strong potential for LLM-based OSCE grading to augment the current grading process. Open-source models also showed promising results, suggesting potential for widespread, cost-effective deployment. Further, we present a failure analysis identifying conditions where LLM grading may be less reliable in this context and recommend best practices for deploying LLMs in medical education settings.
△ Less
Submitted 11 October, 2024;
originally announced October 2024.
-
Classification of Major Depressive Disorder Using Vertex-Wise Brain Sulcal Depth, Curvature, and Thickness with a Deep and a Shallow Learning Model
Authors:
Roberto Goya-Maldonado,
Tracy Erwin-Grabner,
Ling-Li Zeng,
Christopher R. K. Ching,
Andre Aleman,
Alyssa R. Amod,
Zeynep Basgoze,
Francesco Benedetti,
Bianca Besteher,
Katharina Brosch,
Robin Bülow,
Romain Colle,
Colm G. Connolly,
Emmanuelle Corruble,
Baptiste Couvy-Duchesne,
Kathryn Cullen,
Udo Dannlowski,
Christopher G. Davey,
Annemiek Dols,
Jan Ernsting,
Jennifer W. Evans,
Lukas Fisch,
Paola Fuentes-Claramonte,
Ali Saffet Gonul,
Ian H. Gotlib
, et al. (62 additional authors not shown)
Abstract:
Major depressive disorder (MDD) is a complex psychiatric disorder that affects the lives of hundreds of millions of individuals around the globe. Even today, researchers debate if morphological alterations in the brain are linked to MDD, likely due to the heterogeneity of this disorder. The application of deep learning tools to neuroimaging data, capable of capturing complex non-linear patterns, h…
▽ More
Major depressive disorder (MDD) is a complex psychiatric disorder that affects the lives of hundreds of millions of individuals around the globe. Even today, researchers debate if morphological alterations in the brain are linked to MDD, likely due to the heterogeneity of this disorder. The application of deep learning tools to neuroimaging data, capable of capturing complex non-linear patterns, has the potential to provide diagnostic and predictive biomarkers for MDD. However, previous attempts to demarcate MDD patients and healthy controls (HC) based on segmented cortical features via linear machine learning approaches have reported low accuracies. Here, we used globally representative data from the ENIGMA-MDD working group containing 7,012 participants from 30 sites (N=2,772 MDD and N=4,240 HC), which allows a comprehensive analysis with generalizable results. Based on the hypothesis that integration of vertex-wise cortical features can improve classification performance, we evaluated the classification of a DenseNet and a Support Vector Machine (SVM), with the expectation that the former would outperform the latter. We found that both classifiers exhibited close to chance performance (balanced accuracy DenseNet: 51%; SVM: 53%), when estimated on unseen sites. Slightly higher classification performance (balanced accuracy DenseNet: 58%; SVM: 55%) was found when the cross-validation folds contained subjects from all sites, indicating site effect. In conclusion, the integration of vertex-wise morphometric features and the use of the non-linear classifier did not lead to the differentiability between MDD and HC. Our results support the notion that MDD classification on this combination of such features and classifiers is unfeasible. Perhaps more sophisticated integration of multimodal information may lead to a higher performance in this diagnostic task.
△ Less
Submitted 24 January, 2025; v1 submitted 18 November, 2023;
originally announced November 2023.
-
Second Set of Spaces
Authors:
Evangelos Zirintsis,
Graham Kirby,
Alan Dearle,
Ben Allen,
Rob MacInnis,
Andrew McCarthy,
Ron Morrison,
Paddy Nixon,
Andrew Jamieson,
Chris Nicholson,
Steven Harris
Abstract:
This document describes the Gloss infrastructure supporting implementation of location-aware services. The document is in two parts. The first part describes software architecture for the smart space. As described in D8, a local architecture provides a framework for constructing Gloss applications, termed assemblies, that run on individual physical nodes, whereas a global architecture defines an o…
▽ More
This document describes the Gloss infrastructure supporting implementation of location-aware services. The document is in two parts. The first part describes software architecture for the smart space. As described in D8, a local architecture provides a framework for constructing Gloss applications, termed assemblies, that run on individual physical nodes, whereas a global architecture defines an overlay network for linking individual assemblies. The second part outlines the hardware installation for local sensing. This describes the first phase of the installation in Strathclyde University.
△ Less
Submitted 29 June, 2010;
originally announced June 2010.