-
The Optimization Paradox in Clinical AI Multi-Agent Systems
Authors:
Suhana Bedi,
Iddah Mlauzi,
Daniel Shin,
Sanmi Koyejo,
Nigam H. Shah
Abstract:
Multi-agent artificial intelligence systems are increasingly deployed in clinical settings, yet the relationship between component-level optimization and system-wide performance remains poorly understood. We evaluated this relationship using 2,400 real patient cases from the MIMIC-CDM dataset across four abdominal pathologies (appendicitis, pancreatitis, cholecystitis, diverticulitis), decomposing…
▽ More
Multi-agent artificial intelligence systems are increasingly deployed in clinical settings, yet the relationship between component-level optimization and system-wide performance remains poorly understood. We evaluated this relationship using 2,400 real patient cases from the MIMIC-CDM dataset across four abdominal pathologies (appendicitis, pancreatitis, cholecystitis, diverticulitis), decomposing clinical diagnosis into information gathering, interpretation, and differential diagnosis. We evaluated single agent systems (one model performing all tasks) against multi-agent systems (specialized models for each task) using comprehensive metrics spanning diagnostic outcomes, process adherence, and cost efficiency. Our results reveal a paradox: while multi-agent systems generally outperformed single agents, the component-optimized or Best of Breed system with superior components and excellent process metrics (85.5% information accuracy) significantly underperformed in diagnostic accuracy (67.7% vs. 77.4% for a top multi-agent system). This finding underscores that successful integration of AI in healthcare requires not just component level optimization but also attention to information flow and compatibility between agents. Our findings highlight the need for end to end system validation rather than relying on component metrics alone.
△ Less
Submitted 11 June, 2025; v1 submitted 6 June, 2025;
originally announced June 2025.
-
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Authors:
Suhana Bedi,
Hejie Cui,
Miguel Fuentes,
Alyssa Unell,
Michael Wornow,
Juan M. Banda,
Nikesh Kotecha,
Timothy Keyes,
Yifan Mai,
Mert Oez,
Hao Qiu,
Shrey Jain,
Leonardo Schettini,
Mehr Kashyap,
Jason Alan Fries,
Akshay Swaminathan,
Philip Chung,
Fateme Nateghi,
Asad Aali,
Ashwin Nayak,
Shivam Vedak,
Sneha S. Jain,
Birju Patel,
Oluseyi Fayanju,
Shreya Shah
, et al. (56 additional authors not shown)
Abstract:
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcatego…
▽ More
While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this.
△ Less
Submitted 2 June, 2025; v1 submitted 26 May, 2025;
originally announced May 2025.
-
Distilling Large Language Models for Efficient Clinical Information Extraction
Authors:
Karthik S. Vedula,
Annika Gupta,
Akshay Swaminathan,
Ivan Lopez,
Suhana Bedi,
Nigam H. Shah
Abstract:
Large language models (LLMs) excel at clinical information extraction but their computational demands limit practical deployment. Knowledge distillation--the process of transferring knowledge from larger to smaller models--offers a potential solution. We evaluate the performance of distilled BERT models, which are approximately 1,000 times smaller than modern LLMs, for clinical named entity recogn…
▽ More
Large language models (LLMs) excel at clinical information extraction but their computational demands limit practical deployment. Knowledge distillation--the process of transferring knowledge from larger to smaller models--offers a potential solution. We evaluate the performance of distilled BERT models, which are approximately 1,000 times smaller than modern LLMs, for clinical named entity recognition (NER) tasks. We leveraged state-of-the-art LLMs (Gemini and OpenAI models) and medical ontologies (RxNorm and SNOMED) as teacher labelers for medication, disease, and symptom extraction. We applied our approach to over 3,300 clinical notes spanning five publicly available datasets, comparing distilled BERT models against both their teacher labelers and BERT models fine-tuned on human labels. External validation was conducted using clinical notes from the MedAlign dataset. For disease extraction, F1 scores were 0.82 (teacher model), 0.89 (BioBERT trained on human labels), and 0.84 (BioBERT-distilled). For medication, F1 scores were 0.84 (teacher model), 0.91 (BioBERT-human), and 0.87 (BioBERT-distilled). For symptoms: F1 score of 0.73 (teacher model) and 0.68 (BioBERT-distilled). Distilled BERT models had faster inference (12x, 4x, 8x faster than GPT-4o, o1-mini, and Gemini Flash respectively) and lower costs (85x, 101x, 2x cheaper than GPT-4o, o1-mini, and Gemini Flash respectively). On the external validation dataset, the distilled BERT model achieved F1 scores of 0.883 (medication), 0.726 (disease), and 0.699 (symptom). Distilled BERT models were up to 101x cheaper and 12x faster than state-of-the-art LLMs while achieving similar performance on NER tasks. Distillation offers a computationally efficient and scalable alternative to large LLMs for clinical information extraction.
△ Less
Submitted 20 December, 2024;
originally announced January 2025.
-
Context Clues: Evaluating Long Context Models for Clinical Prediction Tasks on EHRs
Authors:
Michael Wornow,
Suhana Bedi,
Miguel Angel Fuentes Hernandez,
Ethan Steinberg,
Jason Alan Fries,
Christopher Re,
Sanmi Koyejo,
Nigam H. Shah
Abstract:
Foundation Models (FMs) trained on Electronic Health Records (EHRs) have achieved state-of-the-art results on numerous clinical prediction tasks. However, most existing EHR FMs have context windows of <1k tokens. This prevents them from modeling full patient EHRs which can exceed 10k's of events. Recent advancements in subquadratic long-context architectures (e.g., Mamba) offer a promising solutio…
▽ More
Foundation Models (FMs) trained on Electronic Health Records (EHRs) have achieved state-of-the-art results on numerous clinical prediction tasks. However, most existing EHR FMs have context windows of <1k tokens. This prevents them from modeling full patient EHRs which can exceed 10k's of events. Recent advancements in subquadratic long-context architectures (e.g., Mamba) offer a promising solution. However, their application to EHR data has not been well-studied. We address this gap by presenting the first systematic evaluation of the effect of context length on modeling EHR data. We find that longer context models improve predictive performance -- our Mamba-based model surpasses the prior state-of-the-art on 9/14 tasks on the EHRSHOT prediction benchmark. For clinical applications, however, model performance alone is insufficient -- robustness to the unique properties of EHR is crucial. Thus, we also evaluate models across three previously underexplored properties of EHR data: (1) the prevalence of "copy-forwarded" diagnoses which creates artificial repetition of tokens within EHR sequences; (2) the irregular time intervals between EHR events which can lead to a wide range of timespans within a context window; and (3) the natural increase in disease complexity over time which makes later tokens in the EHR harder to predict than earlier ones. Stratifying our EHRSHOT results, we find that higher levels of each property correlate negatively with model performance, but that longer context models are more robust to more extreme levels of these properties. Our work highlights the potential for using long-context architectures to model EHR data, and offers a case study for identifying new challenges in modeling sequential data motivated by domains outside of natural language. We release our models and code at: https://github.com/som-shahlab/long_context_clues
△ Less
Submitted 18 March, 2025; v1 submitted 9 December, 2024;
originally announced December 2024.
-
Time-to-Event Pretraining for 3D Medical Imaging
Authors:
Zepeng Huo,
Jason Alan Fries,
Alejandro Lozano,
Jeya Maria Jose Valanarasu,
Ethan Steinberg,
Louis Blankemeier,
Akshay S. Chaudhari,
Curtis Langlotz,
Nigam H. Shah
Abstract:
With the rise of medical foundation models and the growing availability of imaging data, scalable pretraining techniques offer a promising way to identify imaging biomarkers predictive of future disease risk. While current self-supervised methods for 3D medical imaging models capture local structural features like organ morphology, they fail to link pixel biomarkers with long-term health outcomes…
▽ More
With the rise of medical foundation models and the growing availability of imaging data, scalable pretraining techniques offer a promising way to identify imaging biomarkers predictive of future disease risk. While current self-supervised methods for 3D medical imaging models capture local structural features like organ morphology, they fail to link pixel biomarkers with long-term health outcomes due to a missing context problem. Current approaches lack the temporal context necessary to identify biomarkers correlated with disease progression, as they rely on supervision derived only from images and concurrent text descriptions. To address this, we introduce time-to-event pretraining, a pretraining framework for 3D medical imaging models that leverages large-scale temporal supervision from paired, longitudinal electronic health records (EHRs). Using a dataset of 18,945 CT scans (4.2 million 2D images) and time-to-event distributions across thousands of EHR-derived tasks, our method improves outcome prediction, achieving an average AUROC increase of 23.7% and a 29.4% gain in Harrell's C-index across 8 benchmark tasks. Importantly, these gains are achieved without sacrificing diagnostic classification performance. This study lays the foundation for integrating longitudinal EHR and 3D imaging data to advance clinical risk prediction.
△ Less
Submitted 19 March, 2025; v1 submitted 14 November, 2024;
originally announced November 2024.
-
Newtonized Orthogonal Matching Pursuit for High-Resolution Target Detection in Sparse OFDM ISAC Systems
Authors:
Syed Najaf Haider Shah,
Sebastian Semper,
Aamir Ullah Khan,
Christian Schneider,
Joerg Robert
Abstract:
Integrated Sensing and Communication (ISAC) is a technology paradigm that combines sensing capabilities with communication functionalities in a single device or system. In vehicle-to-everything (V2X) sidelink, ISAC can provide enhanced safety by allowing vehicles to not only communicate with one another but also sense the surrounding environment by using sidelink signals. In ISAC-capable V2X sidel…
▽ More
Integrated Sensing and Communication (ISAC) is a technology paradigm that combines sensing capabilities with communication functionalities in a single device or system. In vehicle-to-everything (V2X) sidelink, ISAC can provide enhanced safety by allowing vehicles to not only communicate with one another but also sense the surrounding environment by using sidelink signals. In ISAC-capable V2X sidelink, the random resource allocation results in an unstructured and sparse distribution of time and frequency resources in the received orthogonal frequency division multiplexing (OFDM) grid, leading to degraded radar detection performance when processed using the conventional 2D-FFT method. To address this challenge, this paper proposes a high-resolution off-grid radar target detection algorithm irrespective of the OFDM grid structure. The proposed method utilizes the Newtonized orthogonal matching pursuit (NOMP) algorithm to effectively detect weak targets masked by the sidelobes of stronger ones and accurately estimates off-grid range and velocity parameters with minimal resources through Newton refinements. Simulation results demonstrate the superior performance of the proposed NOMP-based target detection algorithm compared to existing compressed sensing (CS) methods in terms of detection probability, resolution, and accuracy. Additionally, experimental validation is performed using a bi-static radar setup in a semi-anechoic chamber. The measurement results validate the simulation findings, showing that the proposed algorithm significantly enhances target detection and parameter estimation accuracy in realistic scenarios.
△ Less
Submitted 5 November, 2024;
originally announced November 2024.
-
meds_reader: A fast and efficient EHR processing library
Authors:
Ethan Steinberg,
Michael Wornow,
Suhana Bedi,
Jason Alan Fries,
Matthew B. A. McDermott,
Nigam H. Shah
Abstract:
The growing demand for machine learning in healthcare requires processing increasingly large electronic health record (EHR) datasets, but existing pipelines are not computationally efficient or scalable. In this paper, we introduce meds_reader, an optimized Python package for efficient EHR data processing that is designed to take advantage of many intrinsic properties of EHR data for improved spee…
▽ More
The growing demand for machine learning in healthcare requires processing increasingly large electronic health record (EHR) datasets, but existing pipelines are not computationally efficient or scalable. In this paper, we introduce meds_reader, an optimized Python package for efficient EHR data processing that is designed to take advantage of many intrinsic properties of EHR data for improved speed. We then demonstrate the benefits of meds_reader by reimplementing key components of two major EHR processing pipelines, achieving 10-100x improvements in memory, speed, and disk usage. The code for meds_reader can be found at https://github.com/som-shahlab/meds_reader.
△ Less
Submitted 14 November, 2024; v1 submitted 12 September, 2024;
originally announced September 2024.
-
Answering real-world clinical questions using large language model based systems
Authors:
Yen Sia Low,
Michael L. Jackson,
Rebecca J. Hyde,
Robert E. Brown,
Neil M. Sanghavi,
Julian D. Baldwin,
C. William Pike,
Jananee Muralidharan,
Gavin Hui,
Natasha Alexander,
Hadeel Hassan,
Rahul V. Nene,
Morgan Pike,
Courtney J. Pokrzywa,
Shivam Vedak,
Adam Paul Yan,
Dong-han Yao,
Amy R. Zipursky,
Christina Dinh,
Philip Ballentine,
Dan C. Derieg,
Vladimir Polony,
Rehan N. Chawdry,
Jordan Davies,
Brigham B. Hyde
, et al. (2 additional authors not shown)
Abstract:
Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-bas…
▽ More
Evidence to guide healthcare decisions is often limited by a lack of relevant and trustworthy literature as well as difficulty in contextualizing existing research for a specific patient. Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD). We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability. As it stands, general-purpose LLMs (ChatGPT-4, Claude 3 Opus, Gemini Pro 1.5) rarely produced answers that were deemed relevant and evidence-based (2% - 10%). In contrast, retrieval augmented generation (RAG)-based and agentic LLM systems produced relevant and evidence-based answers for 24% (OpenEvidence) to 58% (ChatRWD) of questions. Only the agentic ChatRWD was able to answer novel questions compared to other LLMs (65% vs. 0-9%). These results suggest that while general-purpose LLMs should not be used as-is, a purpose-built system for evidence summarization based on RAG and one for generating novel evidence working synergistically would improve availability of pertinent evidence for patient care.
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
WONDERBREAD: A Benchmark for Evaluating Multimodal Foundation Models on Business Process Management Tasks
Authors:
Michael Wornow,
Avanika Narayan,
Ben Viggiano,
Ishan S. Khare,
Tathagat Verma,
Tibor Thompson,
Miguel Angel Fuentes Hernandez,
Sudharsan Sundar,
Chloe Trujillo,
Krrish Chawla,
Rongfei Lu,
Justin Shen,
Divya Nagaraj,
Joshua Martinez,
Vardhan Agrawal,
Althea Hudson,
Nigam H. Shah,
Christopher Re
Abstract:
Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating models on business process management (BPM) tasks. BPM is the practice of documenting, measuring, improving, and automating enterprise workflows. However, research has focused almost exclusively on one task - full end-to-end automation using agents based on multimodal foundation models (FMs) like GPT-4. This f…
▽ More
Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating models on business process management (BPM) tasks. BPM is the practice of documenting, measuring, improving, and automating enterprise workflows. However, research has focused almost exclusively on one task - full end-to-end automation using agents based on multimodal foundation models (FMs) like GPT-4. This focus on automation ignores the reality of how most BPM tools are applied today - simply documenting the relevant workflow takes 60% of the time of the typical process optimization project. To address this gap we present WONDERBREAD, the first benchmark for evaluating multimodal FMs on BPM tasks beyond automation. Our contributions are: (1) a dataset containing 2928 documented workflow demonstrations; (2) 6 novel BPM tasks sourced from real-world applications ranging from workflow documentation to knowledge transfer to process improvement; and (3) an automated evaluation harness. Our benchmark shows that while state-of-the-art FMs can automatically generate documentation (e.g. recalling 88% of the steps taken in a video demonstration of a workflow), they struggle to re-apply that knowledge towards finer-grained validation of workflow completion (F1 < 0.3). We hope WONDERBREAD encourages the development of more "human-centered" AI tooling for enterprise applications and furthers the exploration of multimodal FMs for the broader universe of BPM tasks. We publish our dataset and experiments here: https://github.com/HazyResearch/wonderbread
△ Less
Submitted 10 October, 2024; v1 submitted 19 June, 2024;
originally announced June 2024.
-
Merlin: A Vision Language Foundation Model for 3D Computed Tomography
Authors:
Louis Blankemeier,
Joseph Paul Cohen,
Ashwin Kumar,
Dave Van Veen,
Syed Jamal Safdar Gardezi,
Magdalini Paschali,
Zhihong Chen,
Jean-Benoit Delbrouck,
Eduardo Reis,
Cesar Truyts,
Christian Bluethgen,
Malte Engmann Kjeldskov Jensen,
Sophie Ostmeier,
Maya Varma,
Jeya Maria Jose Valanarasu,
Zhongnan Fang,
Zepeng Huo,
Zaid Nabulsi,
Diego Ardila,
Wei-Hung Weng,
Edson Amaro Junior,
Neera Ahuja,
Jason Fries,
Nigam H. Shah,
Andrew Johnston
, et al. (6 additional authors not shown)
Abstract:
Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current radiologist shortage, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies. Prior state-of-the-art approaches for automated medical image interpretation leverage vision la…
▽ More
Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current radiologist shortage, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies. Prior state-of-the-art approaches for automated medical image interpretation leverage vision language models (VLMs). However, current medical VLMs are generally limited to 2D images and short reports, and do not leverage electronic health record (EHR) data for supervision. We introduce Merlin - a 3D VLM that we train using paired CT scans (6+ million images from 15,331 CTs), EHR diagnosis codes (1.8+ million codes), and radiology reports (6+ million tokens). We evaluate Merlin on 6 task types and 752 individual tasks. The non-adapted (off-the-shelf) tasks include zero-shot findings classification (31 findings), phenotype classification (692 phenotypes), and zero-shot cross-modal retrieval (image to findings and image to impressions), while model adapted tasks include 5-year disease prediction (6 diseases), radiology report generation, and 3D semantic segmentation (20 organs). We perform internal validation on a test set of 5,137 CTs, and external validation on 7,000 clinical CTs and on two public CT datasets (VerSe, TotalSegmentator). Beyond these clinically-relevant evaluations, we assess the efficacy of various network architectures and training strategies to depict that Merlin has favorable performance to existing task-specific baselines. We derive data scaling laws to empirically assess training data needs for requisite downstream task performance. Furthermore, unlike conventional VLMs that require hundreds of GPUs for training, we perform all training on a single GPU.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Automating the Enterprise with Foundation Models
Authors:
Michael Wornow,
Avanika Narayan,
Krista Opsahl-Ong,
Quinn McIntyre,
Nigam H. Shah,
Christopher Re
Abstract:
Automating enterprise workflows could unlock $4 trillion/year in productivity gains. Despite being of interest to the data management community for decades, the ultimate vision of end-to-end workflow automation has remained elusive. Current solutions rely on process mining and robotic process automation (RPA), in which a bot is hard-coded to follow a set of predefined rules for completing a workfl…
▽ More
Automating enterprise workflows could unlock $4 trillion/year in productivity gains. Despite being of interest to the data management community for decades, the ultimate vision of end-to-end workflow automation has remained elusive. Current solutions rely on process mining and robotic process automation (RPA), in which a bot is hard-coded to follow a set of predefined rules for completing a workflow. Through case studies of a hospital and large B2B enterprise, we find that the adoption of RPA has been inhibited by high set-up costs (12-18 months), unreliable execution (60% initial accuracy), and burdensome maintenance (requiring multiple FTEs). Multimodal foundation models (FMs) such as GPT-4 offer a promising new approach for end-to-end workflow automation given their generalized reasoning and planning abilities. To study these capabilities we propose ECLAIR, a system to automate enterprise workflows with minimal human supervision. We conduct initial experiments showing that multimodal FMs can address the limitations of traditional RPA with (1) near-human-level understanding of workflows (93% accuracy on a workflow understanding task) and (2) instant set-up with minimal technical barrier (based solely on a natural language description of a workflow, ECLAIR achieves end-to-end completion rates of 40%). We identify human-AI collaboration, validation, and self-improvement as open challenges, and suggest ways they can be solved with data management techniques. Code is available at: https://github.com/HazyResearch/eclair-agents
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Standing on FURM ground -- A framework for evaluating Fair, Useful, and Reliable AI Models in healthcare systems
Authors:
Alison Callahan,
Duncan McElfresh,
Juan M. Banda,
Gabrielle Bunney,
Danton Char,
Jonathan Chen,
Conor K. Corbin,
Debadutta Dash,
Norman L. Downing,
Sneha S. Jain,
Nikesh Kotecha,
Jonathan Masterson,
Michelle M. Mello,
Keith Morse,
Srikar Nallan,
Abby Pandya,
Anurang Revri,
Aditya Sharma,
Christopher Sharp,
Rahul Thapa,
Michael Wornow,
Alaa Youssef,
Michael A. Pfeffer,
Nigam H. Shah
Abstract:
The impact of using artificial intelligence (AI) to guide patient care or operational processes is an interplay of the AI model's output, the decision-making protocol based on that output, and the capacity of the stakeholders involved to take the necessary subsequent action. Estimating the effects of this interplay before deployment, and studying it in real time afterwards, are essential to bridge…
▽ More
The impact of using artificial intelligence (AI) to guide patient care or operational processes is an interplay of the AI model's output, the decision-making protocol based on that output, and the capacity of the stakeholders involved to take the necessary subsequent action. Estimating the effects of this interplay before deployment, and studying it in real time afterwards, are essential to bridge the chasm between AI model development and achievable benefit. To accomplish this, the Data Science team at Stanford Health Care has developed a Testing and Evaluation (T&E) mechanism to identify fair, useful and reliable AI models (FURM) by conducting an ethical review to identify potential value mismatches, simulations to estimate usefulness, financial projections to assess sustainability, as well as analyses to determine IT feasibility, design a deployment strategy, and recommend a prospective monitoring and evaluation plan. We report on FURM assessments done to evaluate six AI guided solutions for potential adoption, spanning clinical and operational settings, each with the potential to impact from several dozen to tens of thousands of patients each year. We describe the assessment process, summarize the six assessments, and share our framework to enable others to conduct similar assessments. Of the six solutions we assessed, two have moved into a planning and implementation phase. Our novel contributions - usefulness estimates by simulation, financial projections to quantify sustainability, and a process to do ethical assessments - as well as their underlying methods and open source tools, are available for other healthcare systems to conduct actionable evaluations of candidate AI solutions.
△ Less
Submitted 14 March, 2024; v1 submitted 26 February, 2024;
originally announced March 2024.
-
Zero-Shot Clinical Trial Patient Matching with LLMs
Authors:
Michael Wornow,
Alejandro Lozano,
Dev Dash,
Jenelle Jindal,
Kenneth W. Mahaffey,
Nigam H. Shah
Abstract:
Matching patients to clinical trials is a key unsolved challenge in bringing new drugs to market. Today, identifying patients who meet a trial's eligibility criteria is highly manual, taking up to 1 hour per patient. Automated screening is challenging, however, as it requires understanding unstructured clinical text. Large language models (LLMs) offer a promising solution. In this work, we explore…
▽ More
Matching patients to clinical trials is a key unsolved challenge in bringing new drugs to market. Today, identifying patients who meet a trial's eligibility criteria is highly manual, taking up to 1 hour per patient. Automated screening is challenging, however, as it requires understanding unstructured clinical text. Large language models (LLMs) offer a promising solution. In this work, we explore their application to trial matching. First, we design an LLM-based system which, given a patient's medical history as unstructured clinical text, evaluates whether that patient meets a set of inclusion criteria (also specified as free text). Our zero-shot system achieves state-of-the-art scores on the n2c2 2018 cohort selection benchmark. Second, we improve the data and cost efficiency of our method by identifying a prompting strategy which matches patients an order of magnitude faster and more cheaply than the status quo, and develop a two-stage retrieval pipeline that reduces the number of tokens processed by up to a third while retaining high performance. Third, we evaluate the interpretability of our system by having clinicians evaluate the natural language justifications generated by the LLM for each eligibility decision, and show that it can output coherent explanations for 97% of its correct decisions and 75% of its incorrect ones. Our results establish the feasibility of using LLMs to accelerate clinical trial operations.
△ Less
Submitted 10 April, 2024; v1 submitted 4 February, 2024;
originally announced February 2024.
-
INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis
Authors:
Shih-Cheng Huang,
Zepeng Huo,
Ethan Steinberg,
Chia-Chun Chiang,
Matthew P. Lungren,
Curtis P. Langlotz,
Serena Yeung,
Nigam H. Shah,
Jason A. Fries
Abstract:
Synthesizing information from multiple data sources plays a crucial role in the practice of modern medicine. Current applications of artificial intelligence in medicine often focus on single-modality data due to a lack of publicly available, multimodal medical datasets. To address this limitation, we introduce INSPECT, which contains de-identified longitudinal records from a large cohort of patien…
▽ More
Synthesizing information from multiple data sources plays a crucial role in the practice of modern medicine. Current applications of artificial intelligence in medicine often focus on single-modality data due to a lack of publicly available, multimodal medical datasets. To address this limitation, we introduce INSPECT, which contains de-identified longitudinal records from a large cohort of patients at risk for pulmonary embolism (PE), along with ground truth labels for multiple outcomes. INSPECT contains data from 19,402 patients, including CT images, radiology report impression sections, and structured electronic health record (EHR) data (i.e. demographics, diagnoses, procedures, vitals, and medications). Using INSPECT, we develop and release a benchmark for evaluating several baseline modeling approaches on a variety of important PE related tasks. We evaluate image-only, EHR-only, and multimodal fusion models. Trained models and the de-identified dataset are made available for non-commercial use under a data use agreement. To the best of our knowledge, INSPECT is the largest multimodal dataset integrating 3D medical imaging and EHR for reproducible methods evaluation and research.
△ Less
Submitted 17 November, 2023;
originally announced November 2023.
-
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records
Authors:
Scott L. Fleming,
Alejandro Lozano,
William J. Haberkorn,
Jenelle A. Jindal,
Eduardo P. Reis,
Rahul Thapa,
Louis Blankemeier,
Julian Z. Genkins,
Ethan Steinberg,
Ashwin Nayak,
Birju S. Patel,
Chia-Chun Chiang,
Alison Callahan,
Zepeng Huo,
Sergios Gatidis,
Scott J. Adams,
Oluseyi Fayanju,
Shreya J. Shah,
Thomas Savage,
Ethan Goh,
Akshay S. Chaudhari,
Nima Aghaeepour,
Christopher Sharp,
Michael A. Pfeffer,
Percy Liang
, et al. (5 additional authors not shown)
Abstract:
The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture…
▽ More
The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and an 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. We make MedAlign available under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.
△ Less
Submitted 24 December, 2023; v1 submitted 27 August, 2023;
originally announced August 2023.
-
EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models
Authors:
Michael Wornow,
Rahul Thapa,
Ethan Steinberg,
Jason A. Fries,
Nigam H. Shah
Abstract:
While the general machine learning (ML) community has benefited from public datasets, tasks, and models, the progress of ML in healthcare has been hampered by a lack of such shared assets. The success of foundation models creates new challenges for healthcare ML by requiring access to shared pretrained models to validate performance benefits. We help address these challenges through three contribu…
▽ More
While the general machine learning (ML) community has benefited from public datasets, tasks, and models, the progress of ML in healthcare has been hampered by a lack of such shared assets. The success of foundation models creates new challenges for healthcare ML by requiring access to shared pretrained models to validate performance benefits. We help address these challenges through three contributions. First, we publish a new dataset, EHRSHOT, which contains deidentified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and not restricted to ICU/ED patients. Second, we publish the weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. We are one of the first to fully release such a model for coded EHR data; in contrast, most prior models released for clinical data (e.g. GatorTron, ClinicalBERT) only work with unstructured text and cannot process the rich, structured data within an EHR. We provide an end-to-end pipeline for the community to validate and build upon its performance. Third, we define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation. Our model and dataset are available via a research data use agreement from our website: https://ehrshot.stanford.edu. Code to reproduce our results are available at our Github repo: https://github.com/som-shahlab/ehrshot-benchmark
△ Less
Submitted 11 December, 2023; v1 submitted 5 July, 2023;
originally announced July 2023.
-
All models are local: time to replace external validation with recurrent local validation
Authors:
Alex Youssef,
Michael Pencina,
Anshul Thakur,
Tingting Zhu,
David Clifton,
Nigam H. Shah
Abstract:
External validation is often recommended to ensure the generalizability of ML models. However, it neither guarantees generalizability nor equates to a model's clinical usefulness (the ultimate goal of any clinical decision-support tool). External validation is misaligned with current healthcare ML needs. First, patient data changes across time, geography, and facilities. These changes create signi…
▽ More
External validation is often recommended to ensure the generalizability of ML models. However, it neither guarantees generalizability nor equates to a model's clinical usefulness (the ultimate goal of any clinical decision-support tool). External validation is misaligned with current healthcare ML needs. First, patient data changes across time, geography, and facilities. These changes create significant volatility in the performance of a single fixed model (especially for deep learning models, which dominate clinical ML). Second, newer ML techniques, current market forces, and updated regulatory frameworks are enabling frequent updating and monitoring of individual deployed model instances. We submit that external validation is insufficient to establish ML models' safety or utility. Proposals to fix the external validation paradigm do not go far enough. Continued reliance on it as the ultimate test is likely to lead us astray. We propose the MLOps-inspired paradigm of recurring local validation as an alternative that ensures the validity of models while protecting against performance-disruptive data variability. This paradigm relies on site-specific reliability tests before every deployment, followed by regular and recurrent checks throughout the life cycle of the deployed algorithm. Initial and recurrent reliability tests protect against performance-disruptive distribution shifts, and concept drifts that jeopardize patient safety.
△ Less
Submitted 13 May, 2023; v1 submitted 4 May, 2023;
originally announced May 2023.
-
Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery
Authors:
Debadutta Dash,
Rahul Thapa,
Juan M. Banda,
Akshay Swaminathan,
Morgan Cheatham,
Mehr Kashyap,
Nikesh Kotecha,
Jonathan H. Chen,
Saurabh Gombar,
Lance Downing,
Rachel Pedreira,
Ethan Goh,
Angel Arnaout,
Garret Kenn Morris,
Honor Magon,
Matthew P Lungren,
Eric Horvitz,
Nigam H. Shah
Abstract:
Despite growing interest in using large language models (LLMs) in healthcare, current explorations do not assess the real-world utility and safety of LLMs in clinical settings. Our objective was to determine whether two LLMs can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner. Sixty six questions from an informatic…
▽ More
Despite growing interest in using large language models (LLMs) in healthcare, current explorations do not assess the real-world utility and safety of LLMs in clinical settings. Our objective was to determine whether two LLMs can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner. Sixty six questions from an informatics consult service were submitted to GPT-3.5 and GPT-4 via simple prompts. 12 physicians assessed the LLM responses' possibility of patient harm and concordance with existing reports from an informatics consultation service. Physician assessments were summarized based on majority vote. For no questions did a majority of physicians deem either LLM response as harmful. For GPT-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be assessed. There were 29 responses with no majority on "Agree", "Disagree", and "Unable to assess". For GPT-4, responses to 13 questions were concordant, 15 discordant, and 3 were unable to be assessed. There were 35 responses with no majority. Responses from both LLMs were largely devoid of overt harm, but less than 20% of the responses agreed with an answer from an informatics consultation service, responses contained hallucinated references, and physicians were divided on what constitutes harm. These results suggest that while general purpose LLMs are able to provide safe and credible responses, they often do not meet the specific information need of a given question. A definitive evaluation of the usefulness of LLMs in healthcare settings will likely require additional research on prompt engineering, calibration, and custom-tailoring of general purpose models.
△ Less
Submitted 30 April, 2023; v1 submitted 26 April, 2023;
originally announced April 2023.
-
The Shaky Foundations of Clinical Foundation Models: A Survey of Large Language Models and Foundation Models for EMRs
Authors:
Michael Wornow,
Yizhe Xu,
Rahul Thapa,
Birju Patel,
Ethan Steinberg,
Scott Fleming,
Michael A. Pfeffer,
Jason Fries,
Nigam H. Shah
Abstract:
The successes of foundation models such as ChatGPT and AlphaFold have spurred significant interest in building similar models for electronic medical records (EMRs) to improve patient care and hospital operations. However, recent hype has obscured critical gaps in our understanding of these models' capabilities. We review over 80 foundation models trained on non-imaging EMR data (i.e. clinical text…
▽ More
The successes of foundation models such as ChatGPT and AlphaFold have spurred significant interest in building similar models for electronic medical records (EMRs) to improve patient care and hospital operations. However, recent hype has obscured critical gaps in our understanding of these models' capabilities. We review over 80 foundation models trained on non-imaging EMR data (i.e. clinical text and/or structured data) and create a taxonomy delineating their architectures, training data, and potential use cases. We find that most models are trained on small, narrowly-scoped clinical datasets (e.g. MIMIC-III) or broad, public biomedical corpora (e.g. PubMed) and are evaluated on tasks that do not provide meaningful insights on their usefulness to health systems. In light of these findings, we propose an improved evaluation framework for measuring the benefits of clinical foundation models that is more closely grounded to metrics that matter in healthcare.
△ Less
Submitted 24 March, 2023; v1 submitted 22 March, 2023;
originally announced March 2023.
-
DEPLOYR: A technical framework for deploying custom real-time machine learning models into the electronic medical record
Authors:
Conor K. Corbin,
Rob Maclay,
Aakash Acharya,
Sreedevi Mony,
Soumya Punnathanam,
Rahul Thapa,
Nikesh Kotecha,
Nigam H. Shah,
Jonathan H. Chen
Abstract:
Machine learning (ML) applications in healthcare are extensively researched, but successful translations to the bedside are scant. Healthcare institutions are establishing frameworks to govern and promote the implementation of accurate, actionable and reliable models that integrate with clinical workflow. Such governance frameworks require an accompanying technical framework to deploy models in a…
▽ More
Machine learning (ML) applications in healthcare are extensively researched, but successful translations to the bedside are scant. Healthcare institutions are establishing frameworks to govern and promote the implementation of accurate, actionable and reliable models that integrate with clinical workflow. Such governance frameworks require an accompanying technical framework to deploy models in a resource efficient manner. Here we present DEPLOYR, a technical framework for enabling real-time deployment and monitoring of researcher created clinical ML models into a widely used electronic medical record (EMR) system. We discuss core functionality and design decisions, including mechanisms to trigger inference based on actions within EMR software, modules that collect real-time data to make inferences, mechanisms that close-the-loop by displaying inferences back to end-users within their workflow, monitoring modules that track performance of deployed models over time, silent deployment capabilities, and mechanisms to prospectively evaluate a deployed model's impact. We demonstrate the use of DEPLOYR by silently deploying and prospectively evaluating twelve ML models triggered by clinician button-clicks in Stanford Health Care's production instance of Epic. Our study highlights the need and feasibility for such silent deployment, because prospectively measured performance varies from retrospective estimates. By describing DEPLOYR, we aim to inform ML deployment best practices and help bridge the model implementation gap.
△ Less
Submitted 10 March, 2023;
originally announced March 2023.
-
Instability in clinical risk stratification models using deep learning
Authors:
Daniel Lopez-Martinez,
Alex Yakubovich,
Martin Seneviratne,
Adam D. Lelkes,
Akshit Tyagi,
Jonas Kemp,
Ethan Steinberg,
N. Lance Downing,
Ron C. Li,
Keith E. Morse,
Nigam H. Shah,
Ming-Jun Chen
Abstract:
While it has been well known in the ML community that deep learning models suffer from instability, the consequences for healthcare deployments are under characterised. We study the stability of different model architectures trained on electronic health records, using a set of outpatient prediction tasks as a case study. We show that repeated training runs of the same deep learning model on the sa…
▽ More
While it has been well known in the ML community that deep learning models suffer from instability, the consequences for healthcare deployments are under characterised. We study the stability of different model architectures trained on electronic health records, using a set of outpatient prediction tasks as a case study. We show that repeated training runs of the same deep learning model on the same training data can result in significantly different outcomes at a patient level even though global performance metrics remain stable. We propose two stability metrics for measuring the effect of randomness of model training, as well as mitigation strategies for improving model stability.
△ Less
Submitted 19 November, 2022;
originally announced November 2022.
-
Clinical Utility Gains from Incorporating Comorbidity and Geographic Location Information into Risk Estimation Equations for Atherosclerotic Cardiovascular Disease
Authors:
Yizhe Xu,
Agata Foryciarz,
Ethan Steinberg,
Nigam H. Shah
Abstract:
Objective: There are several efforts to re-learn the 2013 ACC/AHA pooled cohort equations (PCE) for patients with specific comorbidities and geographic locations. With over 363 customized risk models in the literature, we aim to evaluate such revised models to determine if the performance improvements translate to gains in clinical utility.
Methods: We re-train a baseline PCE using the ACC/AHA P…
▽ More
Objective: There are several efforts to re-learn the 2013 ACC/AHA pooled cohort equations (PCE) for patients with specific comorbidities and geographic locations. With over 363 customized risk models in the literature, we aim to evaluate such revised models to determine if the performance improvements translate to gains in clinical utility.
Methods: We re-train a baseline PCE using the ACC/AHA PCE variables and revise it to incorporate subject-level geographic location and comorbidity information. We apply fixed effects, random effects, and extreme gradient boosting models to handle the correlation and heterogeneity induced by locations. Models are trained using 2,464,522 claims records from Optum Clinformatics Data Mart and validated in the hold-out set (N=1,056,224). We evaluate models' performance overall and across subgroups defined by the presence or absence of chronic kidney disease (CKD) or rheumatoid arthritis (RA) and geographic locations. We evaluate models' expected net benefit using decision curve analysis and models' statistical properties using several discrimination and calibration metrics.
Results: The baseline PCE is miscalibrated overall, in patients with CKD or RA, and locations with small populations. Our revised models improved both the overall (GND P-value=0.41) and subgroup calibration but only enhanced net benefit in the underrepresented subgroups. The gains are larger in the subgroups with comorbidities and heterogeneous across geographic locations.
Conclusions: Revising the PCE with comorbidity and location information significantly enhanced models' calibration; however, such improvements do not necessarily translate to clinical gains. Thus, we recommend future works to quantify the consequences from using risk calculators to guide clinical decisions.
△ Less
Submitted 17 September, 2022; v1 submitted 14 September, 2022;
originally announced September 2022.
-
Net benefit, calibration, threshold selection, and training objectives for algorithmic fairness in healthcare
Authors:
Stephen R. Pfohl,
Yizhe Xu,
Agata Foryciarz,
Nikolaos Ignatiadis,
Julian Genkins,
Nigam H. Shah
Abstract:
A growing body of work uses the paradigm of algorithmic fairness to frame the development of techniques to anticipate and proactively mitigate the introduction or exacerbation of health inequities that may follow from the use of model-guided decision-making. We evaluate the interplay between measures of model performance, fairness, and the expected utility of decision-making to offer practical rec…
▽ More
A growing body of work uses the paradigm of algorithmic fairness to frame the development of techniques to anticipate and proactively mitigate the introduction or exacerbation of health inequities that may follow from the use of model-guided decision-making. We evaluate the interplay between measures of model performance, fairness, and the expected utility of decision-making to offer practical recommendations for the operationalization of algorithmic fairness principles for the development and evaluation of predictive models in healthcare. We conduct an empirical case-study via development of models to estimate the ten-year risk of atherosclerotic cardiovascular disease to inform statin initiation in accordance with clinical practice guidelines. We demonstrate that approaches that incorporate fairness considerations into the model training objective typically do not improve model performance or confer greater net benefit for any of the studied patient populations compared to the use of standard learning paradigms followed by threshold selection concordant with patient preferences, evidence of intervention effectiveness, and model calibration. These results hold when the measured outcomes are not subject to differential measurement error across patient populations and threshold selection is unconstrained, regardless of whether differences in model performance metrics, such as in true and false positive error rates, are present. In closing, we argue for focusing model development efforts on developing calibrated models that predict outcomes well for all patient populations while emphasizing that such efforts are complementary to transparent reporting, participatory design, and reasoning about the impact of model-informed interventions in context.
△ Less
Submitted 3 February, 2022;
originally announced February 2022.
-
A comparison of approaches to improve worst-case predictive model performance over patient subpopulations
Authors:
Stephen R. Pfohl,
Haoran Zhang,
Yizhe Xu,
Agata Foryciarz,
Marzyeh Ghassemi,
Nigam H. Shah
Abstract:
Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations, potentially introducing or reinforcing inequities in care access and quality. Model training approaches that aim to maximize worst-case model performance across subpopulations, such as distributionally robust optimization (DRO), attempt to address this…
▽ More
Predictive models for clinical outcomes that are accurate on average in a patient population may underperform drastically for some subpopulations, potentially introducing or reinforcing inequities in care access and quality. Model training approaches that aim to maximize worst-case model performance across subpopulations, such as distributionally robust optimization (DRO), attempt to address this problem without introducing additional harms. We conduct a large-scale empirical study of DRO and several variations of standard learning procedures to identify approaches for model development and selection that consistently improve disaggregated and worst-case performance over subpopulations compared to standard approaches for learning predictive models from electronic health records data. In the course of our evaluation, we introduce an extension to DRO approaches that allows for specification of the metric used to assess worst-case performance. We conduct the analysis for models that predict in-hospital mortality, prolonged length of stay, and 30-day readmission for inpatient admissions, and predict in-hospital mortality using intensive care data. We find that, with relatively few exceptions, no approach performs better, for each patient subpopulation examined, than standard learning procedures using the entire training dataset. These results imply that when it is of interest to improve model performance for patient subpopulations beyond what can be achieved with standard practices, it may be necessary to do so via data collection techniques that increase the effective sample size or reduce the level of noise in the prediction problem.
△ Less
Submitted 1 February, 2022; v1 submitted 27 August, 2021;
originally announced August 2021.
-
Ontology-driven weak supervision for clinical entity classification in electronic health records
Authors:
Jason A. Fries,
Ethan Steinberg,
Saelig Khattar,
Scott L. Fleming,
Jose Posada,
Alison Callahan,
Nigam H. Shah
Abstract:
In the electronic health record, using clinical notes to identify entities such as disorders and their temporality (e.g. the order of an event relative to a time index) can inform many important analyses. However, creating training data for clinical entity tasks is time consuming and sharing labeled data is challenging due to privacy concerns. The information needs of the COVID-19 pandemic highlig…
▽ More
In the electronic health record, using clinical notes to identify entities such as disorders and their temporality (e.g. the order of an event relative to a time index) can inform many important analyses. However, creating training data for clinical entity tasks is time consuming and sharing labeled data is challenging due to privacy concerns. The information needs of the COVID-19 pandemic highlight the need for agile methods of training machine learning models for clinical notes. We present Trove, a framework for weakly supervised entity classification using medical ontologies and expert-generated rules. Our approach, unlike hand-labeled notes, is easy to share and modify, while offering performance comparable to learning from manually labeled training data. In this work, we validate our framework on six benchmark tasks and demonstrate Trove's ability to analyze the records of patients visiting the emergency department at Stanford Health Care for COVID-19 presenting symptoms and risk factors.
△ Less
Submitted 6 April, 2021; v1 submitted 5 August, 2020;
originally announced August 2020.
-
An Empirical Characterization of Fair Machine Learning For Clinical Risk Prediction
Authors:
Stephen R. Pfohl,
Agata Foryciarz,
Nigam H. Shah
Abstract:
The use of machine learning to guide clinical decision making has the potential to worsen existing health disparities. Several recent works frame the problem as that of algorithmic fairness, a framework that has attracted considerable attention and criticism. However, the appropriateness of this framework is unclear due to both ethical as well as technical considerations, the latter of which inclu…
▽ More
The use of machine learning to guide clinical decision making has the potential to worsen existing health disparities. Several recent works frame the problem as that of algorithmic fairness, a framework that has attracted considerable attention and criticism. However, the appropriateness of this framework is unclear due to both ethical as well as technical considerations, the latter of which include trade-offs between measures of fairness and model performance that are not well-understood for predictive models of clinical outcomes. To inform the ongoing debate, we conduct an empirical study to characterize the impact of penalizing group fairness violations on an array of measures of model performance and group fairness. We repeat the analyses across multiple observational healthcare databases, clinical outcomes, and sensitive attributes. We find that procedures that penalize differences between the distributions of predictions across groups induce nearly-universal degradation of multiple performance metrics within groups. On examining the secondary impact of these procedures, we observe heterogeneity of the effect of these procedures on measures of fairness in calibration and ranking across experimental conditions. Beyond the reported trade-offs, we emphasize that analyses of algorithmic fairness in healthcare lack the contextual grounding and causal awareness necessary to reason about the mechanisms that lead to health disparities, as well as about the potential of algorithmic fairness methods to counteract those mechanisms. In light of these limitations, we encourage researchers building predictive models for clinical use to step outside the algorithmic fairness frame and engage critically with the broader sociotechnical context surrounding the use of machine learning in healthcare.
△ Less
Submitted 15 June, 2021; v1 submitted 20 July, 2020;
originally announced July 2020.
-
Using public clinical trial reports to evaluate observational study methods
Authors:
Ethan Steinberg,
Nikolaos Ignatiadis,
Steve Yadlowsky,
Yizhe Xu,
Nigam H. Shah
Abstract:
Observational studies are valuable for estimating the effects of various medical interventions, but are notoriously difficult to evaluate because the methods used in observational studies require many untestable assumptions. This lack of verifiability makes it difficult both to compare different observational study methods and to trust the results of any particular observational study. In this wor…
▽ More
Observational studies are valuable for estimating the effects of various medical interventions, but are notoriously difficult to evaluate because the methods used in observational studies require many untestable assumptions. This lack of verifiability makes it difficult both to compare different observational study methods and to trust the results of any particular observational study. In this work, we propose TrialVerify, a new approach for evaluating observational study methods based on ground truth sourced from clinical trial reports. We process trial reports into a denoised collection of known causal relationships that can then be used to estimate the precision and recall of various observational study methods. We then use TrialVerify to evaluate multiple observational study methods in terms of their ability to identify the known causal relationships from a large national insurance claims dataset. We found that inverse propensity score weighting is an effective approach for accurately reproducing known causal relationships and outperforms other observational study methods. TrialVerify is made freely available for others to evaluate observational study methods.
△ Less
Submitted 13 September, 2022; v1 submitted 24 June, 2020;
originally announced June 2020.
-
Language Models Are An Effective Patient Representation Learning Technique For Electronic Health Record Data
Authors:
Ethan Steinberg,
Ken Jung,
Jason A. Fries,
Conor K. Corbin,
Stephen R. Pfohl,
Nigam H. Shah
Abstract:
Widespread adoption of electronic health records (EHRs) has fueled the development of using machine learning to build prediction models for various clinical outcomes. This process is often constrained by having a relatively small number of patient records for training the model. We demonstrate that using patient representation schemes inspired from techniques in natural language processing can inc…
▽ More
Widespread adoption of electronic health records (EHRs) has fueled the development of using machine learning to build prediction models for various clinical outcomes. This process is often constrained by having a relatively small number of patient records for training the model. We demonstrate that using patient representation schemes inspired from techniques in natural language processing can increase the accuracy of clinical prediction models by transferring information learned from the entire patient population to the task of training a specific model, where only a subset of the population is relevant. Such patient representation schemes enable a 3.5% mean improvement in AUROC on five prediction tasks compared to standard baselines, with the average improvement rising to 19% when only a small number of patient records are available for training the clinical prediction model.
△ Less
Submitted 12 May, 2020; v1 submitted 6 January, 2020;
originally announced January 2020.
-
Counterfactual Reasoning for Fair Clinical Risk Prediction
Authors:
Stephen Pfohl,
Tony Duan,
Daisy Yi Ding,
Nigam H. Shah
Abstract:
The use of machine learning systems to support decision making in healthcare raises questions as to what extent these systems may introduce or exacerbate disparities in care for historically underrepresented and mistreated groups, due to biases implicitly embedded in observational data in electronic health records. To address this problem in the context of clinical risk prediction models, we devel…
▽ More
The use of machine learning systems to support decision making in healthcare raises questions as to what extent these systems may introduce or exacerbate disparities in care for historically underrepresented and mistreated groups, due to biases implicitly embedded in observational data in electronic health records. To address this problem in the context of clinical risk prediction models, we develop an augmented counterfactual fairness criteria to extend the group fairness criteria of equalized odds to an individual level. We do so by requiring that the same prediction be made for a patient, and a counterfactual patient resulting from changing a sensitive attribute, if the factual and counterfactual outcomes do not differ. We investigate the extent to which the augmented counterfactual fairness criteria may be applied to develop fair models for prolonged inpatient length of stay and mortality with observational electronic health records data. As the fairness criteria is ill-defined without knowledge of the data generating process, we use a variational autoencoder to perform counterfactual inference in the context of an assumed causal graph. While our technique provides a means to trade off maintenance of fairness with reduction in predictive performance in the context of a learned generative model, further work is needed to assess the generality of this approach.
△ Less
Submitted 14 July, 2019;
originally announced July 2019.
-
Medical device surveillance with electronic health records
Authors:
Alison Callahan,
Jason A Fries,
Christopher Ré,
James I Huddleston III,
Nicholas J Giori,
Scott Delp,
Nigam H Shah
Abstract:
Post-market medical device surveillance is a challenge facing manufacturers, regulatory agencies, and health care providers. Electronic health records are valuable sources of real world evidence to assess device safety and track device-related patient outcomes over time. However, distilling this evidence remains challenging, as information is fractured across clinical notes and structured records.…
▽ More
Post-market medical device surveillance is a challenge facing manufacturers, regulatory agencies, and health care providers. Electronic health records are valuable sources of real world evidence to assess device safety and track device-related patient outcomes over time. However, distilling this evidence remains challenging, as information is fractured across clinical notes and structured records. Modern machine learning methods for machine reading promise to unlock increasingly complex information from text, but face barriers due to their reliance on large and expensive hand-labeled training sets. To address these challenges, we developed and validated state-of-the-art deep learning methods that identify patient outcomes from clinical notes without requiring hand-labeled training data. Using hip replacements as a test case, our methods accurately extracted implant details and reports of complications and pain from electronic health records with up to 96.3% precision, 98.5% recall, and 97.4% F1, improved classification performance by 12.7- 53.0% over rule-based methods, and detected over 6 times as many complication events compared to using structured data alone. Using these events to assess complication-free survivorship of different implant systems, we found significant variation between implants, including for risk of revision surgery, which could not be detected using coded data alone. Patients with revision surgeries had more hip pain mentions in the post-hip replacement, pre-revision period compared to patients with no evidence of revision surgery (mean hip pain mentions 4.97 vs. 3.23; t = 5.14; p < 0.001). Some implant models were associated with higher or lower rates of hip pain mentions. Our methods complement existing surveillance mechanisms by requiring orders of magnitude less hand-labeled training data, offering a scalable solution for national medical device surveillance.
△ Less
Submitted 3 April, 2019;
originally announced April 2019.
-
A Semi-Supervised Machine Learning Approach to Detecting Recurrent Metastatic Breast Cancer Cases Using Linked Cancer Registry and Electronic Medical Record Data
Authors:
Albee Y. Ling,
Allison W. Kurian,
Jennifer L. Caswell-Jin,
George W. Sledge Jr.,
Nigam H. Shah,
Suzanne R. Tamang
Abstract:
Objectives: Most cancer data sources lack information on metastatic recurrence. Electronic medical records (EMRs) and population-based cancer registries contain complementary information on cancer treatment and outcomes, yet are rarely used synergistically. To enable detection of metastatic breast cancer (MBC), we applied a semi-supervised machine learning framework to linked EMR-California Cancer…
▽ More
Objectives: Most cancer data sources lack information on metastatic recurrence. Electronic medical records (EMRs) and population-based cancer registries contain complementary information on cancer treatment and outcomes, yet are rarely used synergistically. To enable detection of metastatic breast cancer (MBC), we applied a semi-supervised machine learning framework to linked EMR-California Cancer Registry (CCR) data. Materials and Methods: We studied 11,459 female patients treated at Stanford Health Care who received an incident breast cancer diagnosis from 2000-2014. The dataset consisted of structured data and unstructured free-text clinical notes from EMR, linked to CCR, a component of the Surveillance, Epidemiology and End Results (SEER) database. We extracted information on metastatic disease from patient notes to infer a class label and then trained a regularized logistic regression model for MBC classification. We evaluated model performance on a gold standard set of set of 146 patients. Results: There are 495 patients with de novo stage IV MBC, 1,374 patients initially diagnosed with Stage 0-III disease had recurrent MBC, and 9,590 had no evidence of metastatis. The median follow-up time is 96.3 months (mean 97.8, standard deviation 46.7). The best-performing model incorporated both EMR and CCR features. The area under the receiver-operating characteristic curve=0.925 [95% confidence interval: 0.880-0.969], sensitivity=0.861, specificity=0.878 and overall accuracy=0.870. Discussion and Conclusion: A framework for MBC case detection combining EMR and CCR data achieved good sensitivity, specificity and discrimination without requiring expert-labeled examples. This approach enables population-based research on how patients die from cancer and may identify novel predictors of cancer recurrence.
△ Less
Submitted 16 January, 2019;
originally announced January 2019.
-
Predicting Inpatient Discharge Prioritization With Electronic Health Records
Authors:
Anand Avati,
Stephen Pfohl,
Chris Lin,
Thao Nguyen,
Meng Zhang,
Philip Hwang,
Jessica Wetstone,
Kenneth Jung,
Andrew Ng,
Nigam H. Shah
Abstract:
Identifying patients who will be discharged within 24 hours can improve hospital resource management and quality of care. We studied this problem using eight years of Electronic Health Records (EHR) data from Stanford Hospital. We fit models to predict 24 hour discharge across the entire inpatient population. The best performing models achieved an area under the receiver-operator characteristic cu…
▽ More
Identifying patients who will be discharged within 24 hours can improve hospital resource management and quality of care. We studied this problem using eight years of Electronic Health Records (EHR) data from Stanford Hospital. We fit models to predict 24 hour discharge across the entire inpatient population. The best performing models achieved an area under the receiver-operator characteristic curve (AUROC) of 0.85 and an AUPRC of 0.53 on a held out test set. This model was also well calibrated. Finally, we analyzed the utility of this model in a decision theoretic framework to identify regions of ROC space in which using the model increases expected utility compared to the trivial always negative or always positive classifiers.
△ Less
Submitted 2 December, 2018;
originally announced December 2018.
-
Creating Fair Models of Atherosclerotic Cardiovascular Disease Risk
Authors:
Stephen Pfohl,
Ben Marafino,
Adrien Coulet,
Fatima Rodriguez,
Latha Palaniappan,
Nigam H. Shah
Abstract:
Guidelines for the management of atherosclerotic cardiovascular disease (ASCVD) recommend the use of risk stratification models to identify patients most likely to benefit from cholesterol-lowering and other therapies. These models have differential performance across race and gender groups with inconsistent behavior across studies, potentially resulting in an inequitable distribution of beneficia…
▽ More
Guidelines for the management of atherosclerotic cardiovascular disease (ASCVD) recommend the use of risk stratification models to identify patients most likely to benefit from cholesterol-lowering and other therapies. These models have differential performance across race and gender groups with inconsistent behavior across studies, potentially resulting in an inequitable distribution of beneficial therapy. In this work, we leverage adversarial learning and a large observational cohort extracted from electronic health records (EHRs) to develop a "fair" ASCVD risk prediction model with reduced variability in error rates across groups. We empirically demonstrate that our approach is capable of aligning the distribution of risk predictions conditioned on the outcome across several groups simultaneously for models built from high-dimensional EHR data. We also discuss the relevance of these results in the context of the empirical trade-off between fairness and model performance.
△ Less
Submitted 14 June, 2019; v1 submitted 12 September, 2018;
originally announced September 2018.
-
The Effectiveness of Multitask Learning for Phenotyping with Electronic Health Records Data
Authors:
Daisy Yi Ding,
Chloé Simpson,
Stephen Pfohl,
Dave C. Kale,
Kenneth Jung,
Nigam H. Shah
Abstract:
Electronic phenotyping is the task of ascertaining whether an individual has a medical condition of interest by analyzing their medical record and is foundational in clinical informatics. Increasingly, electronic phenotyping is performed via supervised learning. We investigate the effectiveness of multitask learning for phenotyping using electronic health records (EHR) data. Multitask learning aim…
▽ More
Electronic phenotyping is the task of ascertaining whether an individual has a medical condition of interest by analyzing their medical record and is foundational in clinical informatics. Increasingly, electronic phenotyping is performed via supervised learning. We investigate the effectiveness of multitask learning for phenotyping using electronic health records (EHR) data. Multitask learning aims to improve model performance on a target task by jointly learning additional auxiliary tasks and has been used in disparate areas of machine learning. However, its utility when applied to EHR data has not been established, and prior work suggests that its benefits are inconsistent. We present experiments that elucidate when multitask learning with neural nets improves performance for phenotyping using EHR data relative to neural nets trained for a single phenotype and to well-tuned logistic regression baselines. We find that multitask neural nets consistently outperform single-task neural nets for rare phenotypes but underperform for relatively more common phenotypes. The effect size increases as more auxiliary tasks are added. Moreover, multitask learning reduces the sensitivity of neural nets to hyperparameter settings for rare phenotypes. Last, we quantify phenotype complexity and find that neural nets trained with or without multitask learning do not improve on simple baselines unless the phenotypes are sufficiently complex.
△ Less
Submitted 5 January, 2019; v1 submitted 9 August, 2018;
originally announced August 2018.
-
Countdown Regression: Sharp and Calibrated Survival Predictions
Authors:
Anand Avati,
Tony Duan,
Sharon Zhou,
Kenneth Jung,
Nigam H. Shah,
Andrew Ng
Abstract:
Probabilistic survival predictions from models trained with Maximum Likelihood Estimation (MLE) can have high, and sometimes unacceptably high variance. The field of meteorology, where the paradigm of maximizing sharpness subject to calibration is popular, has addressed this problem by using scoring rules beyond MLE, such as the Continuous Ranked Probability Score (CRPS). In this paper we present…
▽ More
Probabilistic survival predictions from models trained with Maximum Likelihood Estimation (MLE) can have high, and sometimes unacceptably high variance. The field of meteorology, where the paradigm of maximizing sharpness subject to calibration is popular, has addressed this problem by using scoring rules beyond MLE, such as the Continuous Ranked Probability Score (CRPS). In this paper we present the \emph{Survival-CRPS}, a generalization of the CRPS to the survival prediction setting, with right-censored and interval-censored variants. We evaluate our ideas on the mortality prediction task using two different Electronic Health Record (EHR) data sets (STARR and MIMIC-III) covering millions of patients, with suitable deep neural network architectures: a Recurrent Neural Network (RNN) for STARR and a Fully Connected Network (FCN) for MIMIC-III. We compare results between the two scoring rules while keeping the network architecture and data fixed, and show that models trained with Survival-CRPS result in sharper predictive distributions compared to those trained by MLE, while still maintaining calibration.
△ Less
Submitted 18 June, 2019; v1 submitted 21 June, 2018;
originally announced June 2018.
-
Monitoring physical function in patients with knee osteoarthritis using data from wearable activity monitors
Authors:
Vibhu Agarwal,
Matthew Smuck,
Nigam H Shah
Abstract:
Currently used clinical assessments for physical function do not objectively quantify daily activities in routine living. Wearable activity monitors enable objective measurement of routine daily activities, but do not map to clinically measured physical performance measures. We represent physical function as a daily activity profile derived from minute-level activity data obtained via a wearable a…
▽ More
Currently used clinical assessments for physical function do not objectively quantify daily activities in routine living. Wearable activity monitors enable objective measurement of routine daily activities, but do not map to clinically measured physical performance measures. We represent physical function as a daily activity profile derived from minute-level activity data obtained via a wearable activity monitor. We construct daily activity profiles representing average time spent in a set of activity classes over consecutive days using the Osteoarthritis Initiative (OAI) data. Using the daily activity profile as input, we trained statistical models that classify subjects into quartiles of objective measurements of physical function as measured via the 400m walk test, the 20m walk test and 5 times sit stand test. We evaluated model performance on held out data from the same calendar year as that used to train the models as well as on activity data two years into the future. The daily activity profile predicts physical performance as measured via clinical assessments. Using held out data, the AUC obtained in classifying performance values in the 1st quartile was 0.79, 0.78 and 0.72, for the 400m walk, the 20m walk and 5 times sit stand tests. For classifying performance values in the 4th quartile, the AUC obtained was 0.77, 0.66 and 0.73 respectively. Evaluated on data from two years into the future, for the 20m pace test and the 5 times sit stand tests, the highest AUC obtained was 0.77 and 0.68 for the 1st quartile and 0.75 and 0.70 for the 4th quartile respectively. We can construct activity profiles representing actual physical function as demonstrated by the relationship between the activity profiles and the clinically measured physical performance measures. Measurement of physical performance via the activity profile as described can enable remote functional monitoring of patients.
△ Less
Submitted 25 January, 2018;
originally announced January 2018.
-
Scalable and accurate deep learning for electronic health records
Authors:
Alvin Rajkomar,
Eyal Oren,
Kai Chen,
Andrew M. Dai,
Nissan Hajaj,
Peter J. Liu,
Xiaobing Liu,
Mimi Sun,
Patrik Sundberg,
Hector Yee,
Kun Zhang,
Gavin E. Duggan,
Gerardo Flores,
Michaela Hardt,
Jamie Irvine,
Quoc Le,
Kurt Litsch,
Jake Marcus,
Alexander Mossin,
Justin Tansuwan,
De Wang,
James Wexler,
Jimbo Wilson,
Dana Ludwig,
Samuel L. Volchenboum
, et al. (9 additional authors not shown)
Abstract:
Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient's record. We propose a representation of p…
▽ More
Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient's record. We propose a representation of patients' entire, raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization. We validated our approach using de-identified EHR data from two U.S. academic medical centers with 216,221 adult patients hospitalized for at least 24 hours. In the sequential format we propose, this volume of EHR data unrolled into a total of 46,864,534,945 data points, including clinical notes. Deep learning models achieved high accuracy for tasks such as predicting in-hospital mortality (AUROC across sites 0.93-0.94), 30-day unplanned readmission (AUROC 0.75-0.76), prolonged length of stay (AUROC 0.85-0.86), and all of a patient's final discharge diagnoses (frequency-weighted AUROC 0.90). These models outperformed state-of-the-art traditional predictive models in all cases. We also present a case-study of a neural-network attribution system, which illustrates how clinicians can gain some transparency into the predictions. We believe that this approach can be used to create accurate and scalable predictions for a variety of clinical scenarios, complete with explanations that directly highlight evidence in the patient's chart.
△ Less
Submitted 11 May, 2018; v1 submitted 24 January, 2018;
originally announced January 2018.
-
Improving Palliative Care with Deep Learning
Authors:
Anand Avati,
Kenneth Jung,
Stephanie Harman,
Lance Downing,
Andrew Ng,
Nigam H. Shah
Abstract:
Improving the quality of end-of-life care for hospitalized patients is a priority for healthcare organizations. Studies have shown that physicians tend to over-estimate prognoses, which in combination with treatment inertia results in a mismatch between patients wishes and actual care at the end of life. We describe a method to address this problem using Deep Learning and Electronic Health Record…
▽ More
Improving the quality of end-of-life care for hospitalized patients is a priority for healthcare organizations. Studies have shown that physicians tend to over-estimate prognoses, which in combination with treatment inertia results in a mismatch between patients wishes and actual care at the end of life. We describe a method to address this problem using Deep Learning and Electronic Health Record (EHR) data, which is currently being piloted, with Institutional Review Board approval, at an academic medical center. The EHR data of admitted patients are automatically evaluated by an algorithm, which brings patients who are likely to benefit from palliative care services to the attention of the Palliative Care team. The algorithm is a Deep Neural Network trained on the EHR data from previous years, to predict all-cause 3-12 month mortality of patients as a proxy for patients that could benefit from palliative care. Our predictions enable the Palliative Care team to take a proactive approach in reaching out to such patients, rather than relying on referrals from treating physicians, or conduct time consuming chart reviews of all patients. We also present a novel interpretation technique which we use to provide explanations of the model's predictions.
△ Less
Submitted 16 November, 2017;
originally announced November 2017.
-
Some methods for heterogeneous treatment effect estimation in high-dimensions
Authors:
Scott Powers,
Junyang Qian,
Kenneth Jung,
Alejandro Schuler,
Nigam H. Shah,
Trevor Hastie,
Robert Tibshirani
Abstract:
When devising a course of treatment for a patient, doctors often have little quantitative evidence on which to base their decisions, beyond their medical education and published clinical trials. Stanford Health Care alone has millions of electronic medical records (EMRs) that are only just recently being leveraged to inform better treatment recommendations. These data present a unique challenge be…
▽ More
When devising a course of treatment for a patient, doctors often have little quantitative evidence on which to base their decisions, beyond their medical education and published clinical trials. Stanford Health Care alone has millions of electronic medical records (EMRs) that are only just recently being leveraged to inform better treatment recommendations. These data present a unique challenge because they are high-dimensional and observational. Our goal is to make personalized treatment recommendations based on the outcomes for past patients similar to a new patient. We propose and analyze three methods for estimating heterogeneous treatment effects using observational data. Our methods perform well in simulations using a wide variety of treatment effect functions, and we present results of applying the two most promising methods to data from The SPRINT Data Analysis Challenge, from a large randomized trial of a treatment for high blood pressure.
△ Less
Submitted 1 July, 2017;
originally announced July 2017.
-
Provenance-Centered Dataset of Drug-Drug Interactions
Authors:
Juan M. Banda,
Tobias Kuhn,
Nigam H. Shah,
Michel Dumontier
Abstract:
Over the years several studies have demonstrated the ability to identify potential drug-drug interactions via data mining from the literature (MEDLINE), electronic health records, public databases (Drugbank), etc. While each one of these approaches is properly statistically validated, they do not take into consideration the overlap between them as one of their decision making variables. In this pa…
▽ More
Over the years several studies have demonstrated the ability to identify potential drug-drug interactions via data mining from the literature (MEDLINE), electronic health records, public databases (Drugbank), etc. While each one of these approaches is properly statistically validated, they do not take into consideration the overlap between them as one of their decision making variables. In this paper we present LInked Drug-Drug Interactions (LIDDI), a public nanopublication-based RDF dataset with trusty URIs that encompasses some of the most cited prediction methods and sources to provide researchers a resource for leveraging the work of others into their prediction methods. As one of the main issues to overcome the usage of external resources is their mappings between drug names and identifiers used, we also provide the set of mappings we curated to be able to compare the multiple sources we aggregate in our dataset.
△ Less
Submitted 20 July, 2015;
originally announced July 2015.