-
Improving Reliability and Explainability of Medical Question Answering through Atomic Fact Checking in Retrieval-Augmented LLMs
Authors:
Juraj Vladika,
Annika Domres,
Mai Nguyen,
Rebecca Moser,
Jana Nano,
Felix Busch,
Lisa C. Adams,
Keno K. Bressem,
Denise Bernhardt,
Stephanie E. Combs,
Kai J. Borm,
Florian Matthes,
Jan C. Peeken
Abstract:
Large language models (LLMs) exhibit extensive medical knowledge but are prone to hallucinations and inaccurate citations, which pose a challenge to their clinical adoption and regulatory compliance. Current methods, such as Retrieval Augmented Generation, partially address these issues by grounding answers in source documents, but hallucinations and low fact-level explainability persist. In this…
▽ More
Large language models (LLMs) exhibit extensive medical knowledge but are prone to hallucinations and inaccurate citations, which pose a challenge to their clinical adoption and regulatory compliance. Current methods, such as Retrieval Augmented Generation, partially address these issues by grounding answers in source documents, but hallucinations and low fact-level explainability persist. In this work, we introduce a novel atomic fact-checking framework designed to enhance the reliability and explainability of LLMs used in medical long-form question answering. This method decomposes LLM-generated responses into discrete, verifiable units called atomic facts, each of which is independently verified against an authoritative knowledge base of medical guidelines. This approach enables targeted correction of errors and direct tracing to source literature, thereby improving the factual accuracy and explainability of medical Q&A. Extensive evaluation using multi-reader assessments by medical experts and an automated open Q&A benchmark demonstrated significant improvements in factual accuracy and explainability. Our framework achieved up to a 40% overall answer improvement and a 50% hallucination detection rate. The ability to trace each atomic fact back to the most relevant chunks from the database provides a granular, transparent explanation of the generated responses, addressing a major gap in current medical AI applications. This work represents a crucial step towards more trustworthy and reliable clinical applications of LLMs, addressing key prerequisites for clinical application and fostering greater confidence in AI-assisted healthcare.
△ Less
Submitted 30 May, 2025;
originally announced May 2025.
-
Graph Neural Networks: A suitable Alternative to MLPs in Latent 3D Medical Image Classification?
Authors:
Johannes Kiechle,
Daniel M. Lang,
Stefan M. Fischer,
Lina Felsner,
Jan C. Peeken,
Julia A. Schnabel
Abstract:
Recent studies have underscored the capabilities of natural imaging foundation models to serve as powerful feature extractors, even in a zero-shot setting for medical imaging data. Most commonly, a shallow multi-layer perceptron (MLP) is appended to the feature extractor to facilitate end-to-end learning and downstream prediction tasks such as classification, thus representing the de facto standar…
▽ More
Recent studies have underscored the capabilities of natural imaging foundation models to serve as powerful feature extractors, even in a zero-shot setting for medical imaging data. Most commonly, a shallow multi-layer perceptron (MLP) is appended to the feature extractor to facilitate end-to-end learning and downstream prediction tasks such as classification, thus representing the de facto standard. However, as graph neural networks (GNNs) have become a practicable choice for various tasks in medical research in the recent past, we direct attention to the question of how effective GNNs are compared to MLP prediction heads for the task of 3D medical image classification, proposing them as a potential alternative. In our experiments, we devise a subject-level graph for each volumetric dataset instance. Therein latent representations of all slices in the volume, encoded through a DINOv2 pretrained vision transformer (ViT), constitute the nodes and their respective node features. We use public datasets to compare the classification heads numerically and evaluate various graph construction and graph convolution methods in our experiments. Our findings show enhancements of the GNN in classification performance and substantial improvements in runtime compared to an MLP prediction head. Additional robustness evaluations further validate the promising performance of the GNN, promoting them as a suitable alternative to traditional MLP classification heads. Our code is publicly available at: https://github.com/compai-lab/2024-miccai-grail-kiechle
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
Progressive Growing of Patch Size: Resource-Efficient Curriculum Learning for Dense Prediction Tasks
Authors:
Stefan M. Fischer,
Lina Felsner,
Richard Osuala,
Johannes Kiechle,
Daniel M. Lang,
Jan C. Peeken,
Julia A. Schnabel
Abstract:
In this work, we introduce Progressive Growing of Patch Size, a resource-efficient implicit curriculum learning approach for dense prediction tasks. Our curriculum approach is defined by growing the patch size during model training, which gradually increases the task's difficulty. We integrated our curriculum into the nnU-Net framework and evaluated the methodology on all 10 tasks of the Medical S…
▽ More
In this work, we introduce Progressive Growing of Patch Size, a resource-efficient implicit curriculum learning approach for dense prediction tasks. Our curriculum approach is defined by growing the patch size during model training, which gradually increases the task's difficulty. We integrated our curriculum into the nnU-Net framework and evaluated the methodology on all 10 tasks of the Medical Segmentation Decathlon. With our approach, we are able to substantially reduce runtime, computational costs, and CO2 emissions of network training compared to classical constant patch size training. In our experiments, the curriculum approach resulted in improved convergence. We are able to outperform standard nnU-Net training, which is trained with constant patch size, in terms of Dice Score on 7 out of 10 MSD tasks while only spending roughly 50% of the original training runtime. To the best of our knowledge, our Progressive Growing of Patch Size is the first successful employment of a sample-length curriculum in the form of patch size in the field of computer vision. Our code is publicly available at https://github.com/compai-lab/2024-miccai-fischer.
△ Less
Submitted 11 July, 2024; v1 submitted 10 July, 2024;
originally announced July 2024.
-
Mask the Unknown: Assessing Different Strategies to Handle Weak Annotations in the MICCAI2023 Mediastinal Lymph Node Quantification Challenge
Authors:
Stefan M. Fischer,
Johannes Kiechle,
Daniel M. Lang,
Jan C. Peeken,
Julia A. Schnabel
Abstract:
Pathological lymph node delineation is crucial in cancer diagnosis, progression assessment, and treatment planning. The MICCAI 2023 Lymph Node Quantification Challenge published the first public dataset for pathological lymph node segmentation in the mediastinum. As lymph node annotations are expensive, the challenge was formed as a weakly supervised learning task, where only a subset of all lymph…
▽ More
Pathological lymph node delineation is crucial in cancer diagnosis, progression assessment, and treatment planning. The MICCAI 2023 Lymph Node Quantification Challenge published the first public dataset for pathological lymph node segmentation in the mediastinum. As lymph node annotations are expensive, the challenge was formed as a weakly supervised learning task, where only a subset of all lymph nodes in the training set have been annotated. For the challenge submission, multiple methods for training on these weakly supervised data were explored, including noisy label training, loss masking of unlabeled data, and an approach that integrated the TotalSegmentator toolbox as a form of pseudo labeling in order to reduce the number of unknown voxels. Furthermore, multiple public TCIA datasets were incorporated into the training to improve the performance of the deep learning model. Our submitted model achieved a Dice score of 0.628 and an average symmetric surface distance of 5.8~mm on the challenge test set. With our submitted model, we accomplished third rank in the MICCAI2023 LNQ challenge. A finding of our analysis was that the integration of all visible, including non-pathological, lymph nodes improved the overall segmentation performance on pathological lymph nodes of the test set. Furthermore, segmentation models trained only on clinically enlarged lymph nodes, as given in the challenge scenario, could not generalize to smaller pathological lymph nodes. The code and model for the challenge submission are available at \url{https://gitlab.lrz.de/compai/MediastinalLymphNodeSegmentation}.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Analysis of clinical, dosimetric and radiomic features for predicting local failure after stereotactic radiotherapy of brain metastases in malignant melanoma
Authors:
Nanna E. Hartong,
Ilias Sachpazidis,
Oliver Blanck,
Lucas Etzel,
Jan C. Peeken,
Stephanie E. Combs,
Horst Urbach,
Maxim Zaitsev,
Dimos Baltas,
Ilinca Popp,
Anca-Ligia Grosu,
Tobias Fechter
Abstract:
Background: This study aimed to predict lesion-specific outcomes after stereotactic radiotherapy (SRT) in patients with brain metastases from malignant melanoma (MBM), using clinical, dosimetric pretherapeutic MRI data. Methods: In this multicenter retrospective study, 517 MBM from 130 patients treated with single-fraction or hypofractionated SRT across three centers were analyzed. From contrast-e…
▽ More
Background: This study aimed to predict lesion-specific outcomes after stereotactic radiotherapy (SRT) in patients with brain metastases from malignant melanoma (MBM), using clinical, dosimetric pretherapeutic MRI data. Methods: In this multicenter retrospective study, 517 MBM from 130 patients treated with single-fraction or hypofractionated SRT across three centers were analyzed. From contrast-enhanced T1-weighted MRI, 1576 radiomic features (RF) were extracted per lesion - 788 from the gross tumor volume (GTV), 788 from a 3 mm peritumoral margin. Clinical data, radiation dose and RF from one center were used for feature selection and model development via nested cross-validation; external validation was performed using the other two centers. Results: Local failure occurred in 72 of 517 lesions (13.9%). Predictive models based on clinical data (model 1), RF (model 2), or both (model 3) achieved c-indices of 0.60 +/- 0.15, 0.65 +/- 0.11, and 0.65 +/- 0.12. RF-based models outperformed the clinical model, while dosimetric data alone were not predictive. Most predictive RF came from the peritumoral margin (92%) vs. GTV (76%). On the first external dataset, all models performed similarly (c-index: 0.60-0.63), but showed poor generalization on the second (c-index < 0.50), likely due to differences in patient characteristics and imaging protocols. Conclusions: Information extracted from pretherapeutic MRI, particularly from the peritumoral area, can support accurate prediction of lesion-specific outcomes after SRT in MBM. When combined with clinical data, these imaging-derived markers offer valuable prognostic insights. However, generalizability remains challenging by heterogeneity in patient populations and MRI protocols.
△ Less
Submitted 22 May, 2025; v1 submitted 31 May, 2024;
originally announced May 2024.
-
All Sizes Matter: Improving Volumetric Brain Segmentation on Small Lesions
Authors:
Ayhan Can Erdur,
Daniel Scholz,
Josef A. Buchner,
Stephanie E. Combs,
Daniel Rueckert,
Jan C. Peeken
Abstract:
Brain metastases (BMs) are the most frequently occurring brain tumors. The treatment of patients having multiple BMs with stereo tactic radiosurgery necessitates accurate localization of the metastases. Neural networks can assist in this time-consuming and costly task that is typically performed by human experts. Particularly challenging is the detection of small lesions since they are often under…
▽ More
Brain metastases (BMs) are the most frequently occurring brain tumors. The treatment of patients having multiple BMs with stereo tactic radiosurgery necessitates accurate localization of the metastases. Neural networks can assist in this time-consuming and costly task that is typically performed by human experts. Particularly challenging is the detection of small lesions since they are often underrepresented in exist ing approaches. Yet, lesion detection is equally important for all sizes. In this work, we develop an ensemble of neural networks explicitly fo cused on detecting and segmenting small BMs. To accomplish this task, we trained several neural networks focusing on individual aspects of the BM segmentation problem: We use blob loss that specifically addresses the imbalance of lesion instances in terms of size and texture and is, therefore, not biased towards larger lesions. In addition, a model using a subtraction sequence between the T1 and T1 contrast-enhanced sequence focuses on low-contrast lesions. Furthermore, we train additional models only on small lesions. Our experiments demonstrate the utility of the ad ditional blob loss and the subtraction sequence. However, including the specialized small lesion models in the ensemble deteriorates segmentation results. We also find domain-knowledge-inspired postprocessing steps to drastically increase our performance in most experiments. Our approach enables us to submit a competitive challenge entry to the ASNR-MICCAI BraTS Brain Metastasis Challenge 2023.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Interpretable 2D Vision Models for 3D Medical Images
Authors:
Alexander Ziller,
Ayhan Can Erdur,
Marwa Trigui,
Alp Güvenir,
Tamara T. Mueller,
Philip Müller,
Friederike Jungmann,
Johannes Brandt,
Jan Peeken,
Rickmer Braren,
Daniel Rueckert,
Georgios Kaissis
Abstract:
Training Artificial Intelligence (AI) models on 3D images presents unique challenges compared to the 2D case: Firstly, the demand for computational resources is significantly higher, and secondly, the availability of large datasets for pre-training is often limited, impeding training success. This study proposes a simple approach of adapting 2D networks with an intermediate feature representation…
▽ More
Training Artificial Intelligence (AI) models on 3D images presents unique challenges compared to the 2D case: Firstly, the demand for computational resources is significantly higher, and secondly, the availability of large datasets for pre-training is often limited, impeding training success. This study proposes a simple approach of adapting 2D networks with an intermediate feature representation for processing 3D images. Our method employs attention pooling to learn to assign each slice an importance weight and, by that, obtain a weighted average of all 2D slices. These weights directly quantify the contribution of each slice to the contribution and thus make the model prediction inspectable. We show on all 3D MedMNIST datasets as benchmark and two real-world datasets consisting of several hundred high-resolution CT or MRI scans that our approach performs on par with existing methods. Furthermore, we compare the in-built interpretability of our approach to HiResCam, a state-of-the-art retrospective interpretability approach.
△ Less
Submitted 5 December, 2023; v1 submitted 13 July, 2023;
originally announced July 2023.
-
blob loss: instance imbalance aware loss functions for semantic segmentation
Authors:
Florian Kofler,
Suprosanna Shit,
Ivan Ezhov,
Lucas Fidon,
Izabela Horvath,
Rami Al-Maskari,
Hongwei Li,
Harsharan Bhatia,
Timo Loehr,
Marie Piraud,
Ali Erturk,
Jan Kirschke,
Jan C. Peeken,
Tom Vercauteren,
Claus Zimmer,
Benedikt Wiestler,
Bjoern Menze
Abstract:
Deep convolutional neural networks (CNN) have proven to be remarkably effective in semantic segmentation tasks. Most popular loss functions were introduced targeting improved volumetric scores, such as the Dice coefficient (DSC). By design, DSC can tackle class imbalance, however, it does not recognize instance imbalance within a class. As a result, a large foreground instance can dominate minor i…
▽ More
Deep convolutional neural networks (CNN) have proven to be remarkably effective in semantic segmentation tasks. Most popular loss functions were introduced targeting improved volumetric scores, such as the Dice coefficient (DSC). By design, DSC can tackle class imbalance, however, it does not recognize instance imbalance within a class. As a result, a large foreground instance can dominate minor instances and still produce a satisfactory DSC. Nevertheless, detecting tiny instances is crucial for many applications, such as disease monitoring. For example, it is imperative to locate and surveil small-scale lesions in the follow-up of multiple sclerosis patients. We propose a novel family of loss functions, \emph{blob loss}, primarily aimed at maximizing instance-level detection metrics, such as F1 score and sensitivity. \emph{Blob loss} is designed for semantic segmentation problems where detecting multiple instances matters. We extensively evaluate a DSC-based \emph{blob loss} in five complex 3D semantic segmentation tasks featuring pronounced instance heterogeneity in terms of texture and morphology. Compared to soft Dice loss, we achieve 5% improvement for MS lesions, 3% improvement for liver tumor, and an average 2% improvement for microscopy segmentation tasks considering F1 score.
△ Less
Submitted 6 June, 2023; v1 submitted 17 May, 2022;
originally announced May 2022.
-
A unified 3D framework for Organs at Risk Localization and Segmentation for Radiation Therapy Planning
Authors:
Fernando Navarro,
Guido Sasahara,
Suprosanna Shit,
Ivan Ezhov,
Jan C. Peeken,
Stephanie E. Combs,
Bjoern H. Menze
Abstract:
Automatic localization and segmentation of organs-at-risk (OAR) in CT are essential pre-processing steps in medical image analysis tasks, such as radiation therapy planning. For instance, the segmentation of OAR surrounding tumors enables the maximization of radiation to the tumor area without compromising the healthy tissues. However, the current medical workflow requires manual delineation of OA…
▽ More
Automatic localization and segmentation of organs-at-risk (OAR) in CT are essential pre-processing steps in medical image analysis tasks, such as radiation therapy planning. For instance, the segmentation of OAR surrounding tumors enables the maximization of radiation to the tumor area without compromising the healthy tissues. However, the current medical workflow requires manual delineation of OAR, which is prone to errors and is annotator-dependent. In this work, we aim to introduce a unified 3D pipeline for OAR localization-segmentation rather than novel localization or segmentation architectures. To the best of our knowledge, our proposed framework fully enables the exploitation of 3D context information inherent in medical imaging. In the first step, a 3D multi-variate regression network predicts organs' centroids and bounding boxes. Secondly, 3D organ-specific segmentation networks are leveraged to generate a multi-organ segmentation map. Our method achieved an overall Dice score of $0.9260\pm 0.18 \%$ on the VISCERAL dataset containing CT scans with varying fields of view and multiple organs.
△ Less
Submitted 1 March, 2022;
originally announced March 2022.
-
Evaluating the Robustness of Self-Supervised Learning in Medical Imaging
Authors:
Fernando Navarro,
Christopher Watanabe,
Suprosanna Shit,
Anjany Sekuboyina,
Jan C. Peeken,
Stephanie E. Combs,
Bjoern H. Menze
Abstract:
Self-supervision has demonstrated to be an effective learning strategy when training target tasks on small annotated data-sets. While current research focuses on creating novel pretext tasks to learn meaningful and reusable representations for the target task, these efforts obtain marginal performance gains compared to fully-supervised learning. Meanwhile, little attention has been given to study…
▽ More
Self-supervision has demonstrated to be an effective learning strategy when training target tasks on small annotated data-sets. While current research focuses on creating novel pretext tasks to learn meaningful and reusable representations for the target task, these efforts obtain marginal performance gains compared to fully-supervised learning. Meanwhile, little attention has been given to study the robustness of networks trained in a self-supervised manner. In this work, we demonstrate that networks trained via self-supervised learning have superior robustness and generalizability compared to fully-supervised learning in the context of medical imaging. Our experiments on pneumonia detection in X-rays and multi-organ segmentation in CT yield consistent results exposing the hidden benefits of self-supervision for learning robust feature representations.
△ Less
Submitted 14 May, 2021;
originally announced May 2021.
-
Deep Learning Based HPV Status Prediction for Oropharyngeal Cancer Patients
Authors:
Daniel M. Lang,
Jan C. Peeken,
Stephanie E. Combs,
Jan J. Wilkens,
Stefan Bartzsch
Abstract:
We investigated the ability of deep learning models for imaging based HPV status detection. To overcome the problem of small medical datasets we used a transfer learning approach. A 3D convolutional network pre-trained on sports video clips was fine tuned such that full 3D information in the CT images could be exploited. The video pre-trained model was able to differentiate HPV-positive from HPV-n…
▽ More
We investigated the ability of deep learning models for imaging based HPV status detection. To overcome the problem of small medical datasets we used a transfer learning approach. A 3D convolutional network pre-trained on sports video clips was fine tuned such that full 3D information in the CT images could be exploited. The video pre-trained model was able to differentiate HPV-positive from HPV-negative cases with an area under the receiver operating characteristic curve (AUC) of 0.81 for an external test set. In comparison to a 3D convolutional neural network (CNN) trained from scratch and a 2D architecture pre-trained on ImageNet the video pre-trained model performed best.
△ Less
Submitted 17 November, 2020;
originally announced November 2020.
-
Deep Reinforcement Learning for Organ Localization in CT
Authors:
Fernando Navarro,
Anjany Sekuboyina,
Diana Waldmannstetter,
Jan C. Peeken,
Stephanie E. Combs,
Bjoern H. Menze
Abstract:
Robust localization of organs in computed tomography scans is a constant pre-processing requirement for organ-specific image retrieval, radiotherapy planning, and interventional image analysis. In contrast to current solutions based on exhaustive search or region proposals, which require large amounts of annotated data, we propose a deep reinforcement learning approach for organ localization in CT…
▽ More
Robust localization of organs in computed tomography scans is a constant pre-processing requirement for organ-specific image retrieval, radiotherapy planning, and interventional image analysis. In contrast to current solutions based on exhaustive search or region proposals, which require large amounts of annotated data, we propose a deep reinforcement learning approach for organ localization in CT. In this work, an artificial agent is actively self-taught to localize organs in CT by learning from its asserts and mistakes. Within the context of reinforcement learning, we propose a novel set of actions tailored for organ localization in CT. Our method can use as a plug-and-play module for localizing any organ of interest. We evaluate the proposed solution on the public VISCERAL dataset containing CT scans with varying fields of view and multiple organs. We achieved an overall intersection over union of 0.63, an absolute median wall distance of 2.25 mm, and a median distance between centroids of 3.65 mm.
△ Less
Submitted 11 May, 2020;
originally announced May 2020.
-
Neighborhood Watch: Representation Learning with Local-Margin Triplet Loss and Sampling Strategy for K-Nearest-Neighbor Image Classification
Authors:
Phawis Thammasorn,
Daniel Hippe,
Wanpracha Chaovalitwongse,
Matthew Spraker,
Landon Wootton,
Matthew Nyflot,
Stephanie Combs,
Jan Peeken,
Eric Ford
Abstract:
Deep representation learning using triplet network for classification suffers from a lack of theoretical foundation and difficulty in tuning both the network and classifiers for performance. To address the problem, local-margin triplet loss along with local positive and negative mining strategy is proposed with theory on how the strategy integrate nearest-neighbor hyper-parameter with triplet lear…
▽ More
Deep representation learning using triplet network for classification suffers from a lack of theoretical foundation and difficulty in tuning both the network and classifiers for performance. To address the problem, local-margin triplet loss along with local positive and negative mining strategy is proposed with theory on how the strategy integrate nearest-neighbor hyper-parameter with triplet learning to increase subsequent classification performance. Results in experiments with 2 public datasets, MNIST and Cifar-10, and 2 small medical image datasets demonstrate that proposed strategy outperforms end-to-end softmax and typical triplet loss in settings without data augmentation while maintaining utility of transferable feature for related tasks. The method serves as a good performance baseline where end-to-end methods encounter difficulties such as small sample data with limited allowable data augmentation.
△ Less
Submitted 28 October, 2019;
originally announced November 2019.
-
Shape-Aware Complementary-Task Learning for Multi-Organ Segmentation
Authors:
Fernando Navarro,
Suprosanna Shit,
Ivan Ezhov,
Johannes Paetzold,
Andrei Gafita,
Jan Peeken,
Stephanie Combs,
Bjoern Menze
Abstract:
Multi-organ segmentation in whole-body computed tomography (CT) is a constant pre-processing step which finds its application in organ-specific image retrieval, radiotherapy planning, and interventional image analysis. We address this problem from an organ-specific shape-prior learning perspective. We introduce the idea of complementary-task learning to enforce shape-prior leveraging the existing…
▽ More
Multi-organ segmentation in whole-body computed tomography (CT) is a constant pre-processing step which finds its application in organ-specific image retrieval, radiotherapy planning, and interventional image analysis. We address this problem from an organ-specific shape-prior learning perspective. We introduce the idea of complementary-task learning to enforce shape-prior leveraging the existing target labels. We propose two complementary-tasks namely i) distance map regression and ii) contour map detection to explicitly encode the geometric properties of each organ. We evaluate the proposed solution on the public VISCERAL dataset containing CT scans of multiple organs. We report a significant improvement of overall dice score from 0.8849 to 0.9018 due to the incorporation of complementary-task learning.
△ Less
Submitted 14 August, 2019;
originally announced August 2019.