Search | arXiv e-print repository

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Authors: Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M. Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, Akshay Swaminathan, Philip Chung, Fateme Nateghi, Asad Aali, Ashwin Nayak, Shivam Vedak, Sneha S. Jain, Birju Patel, Oluseyi Fayanju, Shreya Shah , et al. (56 additional authors not shown)

Abstract: While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcatego… ▽ More While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. We introduce MedHELM, an extensible evaluation framework for assessing LLM performance for medical tasks with three key contributions. First, a clinician-validated taxonomy spanning 5 categories, 22 subcategories, and 121 tasks developed with 29 clinicians. Second, a comprehensive benchmark suite comprising 35 benchmarks (17 existing, 18 newly formulated) providing complete coverage of all categories and subcategories in the taxonomy. Third, a systematic comparison of LLMs with improved evaluation methods (using an LLM-jury) and a cost-performance analysis. Evaluation of 9 frontier LLMs, using the 35 benchmarks, revealed significant performance variation. Advanced reasoning models (DeepSeek R1: 66% win-rate; o3-mini: 64% win-rate) demonstrated superior performance, though Claude 3.5 Sonnet achieved comparable results at 40% lower estimated computational cost. On a normalized accuracy scale (0-1), most models performed strongly in Clinical Note Generation (0.73-0.85) and Patient Communication & Education (0.78-0.83), moderately in Medical Research Assistance (0.65-0.75), and generally lower in Clinical Decision Support (0.56-0.72) and Administration & Workflow (0.53-0.63). Our LLM-jury evaluation method achieved good agreement with clinician ratings (ICC = 0.47), surpassing both average clinician-clinician agreement (ICC = 0.43) and automated baselines including ROUGE-L (0.36) and BERTScore-F1 (0.44). Claude 3.5 Sonnet achieved comparable performance to top models at lower estimated cost. These findings highlight the importance of real-world, task-specific evaluation for medical use of LLMs and provides an open source framework to enable this. △ Less

Submitted 2 June, 2025; v1 submitted 26 May, 2025; originally announced May 2025.

arXiv:2501.03155 [pdf, other]

powerROC: An Interactive Web Tool for Sample Size Calculation in Assessing Models' Discriminative Abilities

Authors: François Grolleau, Robert Tibshirani, Jonathan H. Chen

Abstract: Rigorous external validation is crucial for assessing the generalizability of prediction models, particularly by evaluating their discrimination (AUROC) on new data. This often involves comparing a new model's AUROC to that of an established reference model. However, many studies rely on arbitrary rules of thumb for sample size calculations, often resulting in underpowered analyses and unreliable… ▽ More Rigorous external validation is crucial for assessing the generalizability of prediction models, particularly by evaluating their discrimination (AUROC) on new data. This often involves comparing a new model's AUROC to that of an established reference model. However, many studies rely on arbitrary rules of thumb for sample size calculations, often resulting in underpowered analyses and unreliable conclusions. This paper reviews crucial concepts for accurate sample size determination in AUROC-based external validation studies, making the theory and practice more accessible to researchers and clinicians. We introduce powerROC, an open-source web tool designed to simplify these calculations, enabling both the evaluation of a single model and the comparison of two models. The tool offers guidance on selecting target precision levels and employs flexible approaches, leveraging either pilot data or user-defined probability distributions. We illustrate powerROC's utility through a case study on hospital mortality prediction using the MIMIC database. △ Less

Submitted 6 January, 2025; originally announced January 2025.

arXiv:2405.02779 [pdf, ps, other]

Estimating Complier Average Causal Effects with Mixtures of Experts

Authors: François Grolleau, Céline Béji, Raphaël Porcher, François Petit

Abstract: Treatment non-compliance, where individuals deviate from their assigned experimental conditions, frequently complicates the estimation of causal effects. To address this, we introduce a novel learning framework based on a mixture of experts architecture to estimate the Complier Average Causal Effect (CACE). Our framework provides a flexible alternative to classical instrumental variable methods by… ▽ More Treatment non-compliance, where individuals deviate from their assigned experimental conditions, frequently complicates the estimation of causal effects. To address this, we introduce a novel learning framework based on a mixture of experts architecture to estimate the Complier Average Causal Effect (CACE). Our framework provides a flexible alternative to classical instrumental variable methods by relaxing their strict monotonicity and exclusion restriction assumptions. We develop a principled, two-step procedure where each step is optimized with a dedicated Expectation-Maximization (EM) algorithm. Crucially, we provide formal proofs that the model's components are identifiable, ensuring the learning procedure is well-posed. The resulting CACE estimators are proven to be consistent and asymptotically normal. Extensive simulations demonstrate that our method achieves a substantially lower root mean squared error than traditional instrumental variable approaches when their assumptions fail, an advantage that persists even when our own mixture of experts are misspecified. We illustrate the framework's practical utility on data from a large-scale randomized trial. △ Less

Submitted 24 June, 2025; v1 submitted 4 May, 2024; originally announced May 2024.

arXiv:2207.06275 [pdf, other]

A Comprehensive Framework for the Evaluation of Individual Treatment Rules From Observational Data

Authors: François Grolleau, Francois Petit, Raphaël Porcher

Abstract: Individualized treatment rules (ITRs) are deterministic decision rules that recommend treatments to individuals based on their characteristics. Though ubiquitous in medicine, ITRs are hardly ever evaluated in randomized controlled trials. To evaluate ITRs from observational data, we introduce a new probabilistic model and distinguish two situations: i) the situation of a newly developed ITR, where… ▽ More Individualized treatment rules (ITRs) are deterministic decision rules that recommend treatments to individuals based on their characteristics. Though ubiquitous in medicine, ITRs are hardly ever evaluated in randomized controlled trials. To evaluate ITRs from observational data, we introduce a new probabilistic model and distinguish two situations: i) the situation of a newly developed ITR, where data are from a population where no patient implements the ITR, and ii) the situation of a partially implemented ITR, where data are from a population where the ITR is implemented in some unidentified patients. In the former situation, we propose a procedure to explore the impact of an ITR under various implementation schemes. In the latter situation, on top of the fundamental problem of causal inference, we need to handle an additional latent variable denoting implementation. To evaluate ITRs in this situation, we propose an estimation procedure that relies on an expectation-maximization algorithm. In Monte Carlo simulations our estimators appear unbiased with confidence intervals achieving nominal coverage. We illustrate our approach on the MIMIC-III database, focusing on ITRs for dialysis initiation in patients with acute kidney injury. △ Less

Submitted 21 August, 2023; v1 submitted 13 July, 2022; originally announced July 2022.

Showing 1–4 of 4 results for author: Grolleau, F