-
Global explainability of a deep abstaining classifier
Authors:
Sayera Dhaubhadel,
Jamaludin Mohd-Yusof,
Benjamin H. McMahon,
Trilce Estrada,
Kumkum Ganguly,
Adam Spannaus,
John P. Gounley,
Xiao-Cheng Wu,
Eric B. Durbin,
Heidi A. Hanson,
Tanmoy Bhattacharya
Abstract:
We present a global explainability method to characterize sources of errors in the histology prediction task of our real-world multitask convolutional neural network (MTCNN)-based deep abstaining classifier (DAC), for automated annotation of cancer pathology reports from NCI-SEER registries. Our classifier was trained and evaluated on 1.04 million hand-annotated samples and makes simultaneous pred…
▽ More
We present a global explainability method to characterize sources of errors in the histology prediction task of our real-world multitask convolutional neural network (MTCNN)-based deep abstaining classifier (DAC), for automated annotation of cancer pathology reports from NCI-SEER registries. Our classifier was trained and evaluated on 1.04 million hand-annotated samples and makes simultaneous predictions of cancer site, subsite, histology, laterality, and behavior for each report. The DAC framework enables the model to abstain on ambiguous reports and/or confusing classes to achieve a target accuracy on the retained (non-abstained) samples, but at the cost of decreased coverage. Requiring 97% accuracy on the histology task caused our model to retain only 22% of all samples, mostly the less ambiguous and common classes. Local explainability with the GradInp technique provided a computationally efficient way of obtaining contextual reasoning for thousands of individual predictions. Our method, involving dimensionality reduction of approximately 13000 aggregated local explanations, enabled global identification of sources of errors as hierarchical complexity among classes, label noise, insufficient information, and conflicting evidence. This suggests several strategies such as exclusion criteria, focused annotation, and reduced penalties for errors involving hierarchically related classes to iteratively improve our DAC in this complex real-world implementation.
△ Less
Submitted 1 April, 2025;
originally announced April 2025.
-
Domain Shift Analysis in Chest Radiographs Classification in a Veterans Healthcare Administration Population
Authors:
Mayanka Chandrashekar,
Ian Goethert,
Md Inzamam Ul Haque,
Benjamin McMahon,
Sayera Dhaubhadel,
Kathryn Knight,
Joseph Erdos,
Donna Reagan,
Caroline Taylor,
Peter Kuzmak,
John Michael Gaziano,
Eileen McAllister,
Lauren Costa,
Yuk-Lam Ho,
Kelly Cho,
Suzanne Tamang,
Samah Fodeh-Jarad,
Olga S. Ovchinnikova,
Amy C. Justice,
Jacob Hinkle,
Ioana Danciu
Abstract:
Objectives: This study aims to assess the impact of domain shift on chest X-ray classification accuracy and to analyze the influence of ground truth label quality and demographic factors such as age group, sex, and study year. Materials and Methods: We used a DenseNet121 model pretrained MIMIC-CXR dataset for deep learning-based multilabel classification using ground truth labels from radiology re…
▽ More
Objectives: This study aims to assess the impact of domain shift on chest X-ray classification accuracy and to analyze the influence of ground truth label quality and demographic factors such as age group, sex, and study year. Materials and Methods: We used a DenseNet121 model pretrained MIMIC-CXR dataset for deep learning-based multilabel classification using ground truth labels from radiology reports extracted using the CheXpert and CheXbert Labeler. We compared the performance of the 14 chest X-ray labels on the MIMIC-CXR and Veterans Healthcare Administration chest X-ray dataset (VA-CXR). The VA-CXR dataset comprises over 259k chest X-ray images spanning between the years 2010 and 2022. Results: The validation of ground truth and the assessment of multi-label classification performance across various NLP extraction tools revealed that the VA-CXR dataset exhibited lower disagreement rates than the MIMIC-CXR datasets. Additionally, there were notable differences in AUC scores between models utilizing CheXpert and CheXbert. When evaluating multi-label classification performance across different datasets, minimal domain shift was observed in unseen datasets, except for the label "Enlarged Cardiomediastinum." The study year's subgroup analyses exhibited the most significant variations in multi-label classification model performance. These findings underscore the importance of considering domain shifts in chest X-ray classification tasks, particularly concerning study years. Conclusion: Our study reveals the significant impact of domain shift and demographic factors on chest X-ray classification, emphasizing the need for improved transfer learning and equitable model development. Addressing these challenges is crucial for advancing medical imaging and enhancing patient care.
△ Less
Submitted 30 July, 2024;
originally announced July 2024.
-
Towards a machine-readable literature: finding relevant papers based on an uploaded powder diffraction pattern
Authors:
Berrak Özer,
Martin A. Karlsen,
Zachary Thatcher,
Ling Lan,
Brian McMahon,
Peter R. Strickland,
Simon P. Westrip,
Koh S. Sang,
David G. Billing,
Dorthe B. Ravnsbæk,
Simon J. L. Billinge
Abstract:
We investigate a prototype application for machine-readable literature. The program is called "pyDataRecognition" and serves as an example of a data-driven literature search, where the literature search query is an experimental data-set provided by the user. The user uploads a powder pattern together with the radiation wavelength. The program compares the user data to a database of existing powder…
▽ More
We investigate a prototype application for machine-readable literature. The program is called "pyDataRecognition" and serves as an example of a data-driven literature search, where the literature search query is an experimental data-set provided by the user. The user uploads a powder pattern together with the radiation wavelength. The program compares the user data to a database of existing powder patterns associated with published papers and produces a rank ordered according to their similarity score. The program returns the digital object identifier (doi) and full reference of top ranked papers together with a stack plot of the user data alongside the top five database entries. The paper describes the approach and explores successes and challenges.
△ Less
Submitted 1 April, 2022;
originally announced April 2022.
-
Why I'm not Answering: Understanding Determinants of Classification of an Abstaining Classifier for Cancer Pathology Reports
Authors:
Sayera Dhaubhadel,
Jamaludin Mohd-Yusof,
Kumkum Ganguly,
Gopinath Chennupati,
Sunil Thulasidasan,
Nicolas W. Hengartner,
Brent J. Mumphrey,
Eric B. Durbin,
Jennifer A. Doherty,
Mireille Lemieux,
Noah Schaefferkoetter,
Georgia Tourassi,
Linda Coyle,
Lynne Penberthy,
Benjamin H. McMahon,
Tanmoy Bhattacharya
Abstract:
Safe deployment of deep learning systems in critical real world applications requires models to make very few mistakes, and only under predictable circumstances. In this work, we address this problem using an abstaining classifier that is tuned to have $>$95% accuracy, and then identify the determinants of abstention using LIME. Essentially, we are training our model to learn the attributes of pat…
▽ More
Safe deployment of deep learning systems in critical real world applications requires models to make very few mistakes, and only under predictable circumstances. In this work, we address this problem using an abstaining classifier that is tuned to have $>$95% accuracy, and then identify the determinants of abstention using LIME. Essentially, we are training our model to learn the attributes of pathology reports that are likely to lead to incorrect classifications, albeit at the cost of reduced sensitivity. We demonstrate an abstaining classifier in a multitask setting for classifying cancer pathology reports from the NCI SEER cancer registries on six tasks of interest. For these tasks, we reduce the classification error rate by factors of 2--5 by abstaining on 25--45% of the reports. For the specific task of classifying cancer site, we are able to identify metastasis, reports involving lymph nodes, and discussion of multiple cancer sites as responsible for many of the classification mistakes, and observe that the extent and types of mistakes vary systematically with cancer site (e.g., breast, lung, and prostate). When combining across three of the tasks, our model classifies 50% of the reports with an accuracy greater than 95% for three of the six tasks\edit, and greater than 85% for all six tasks on the retained samples. Furthermore, we show that LIME provides a better determinant of classification than measures of word occurrence alone. By combining a deep abstaining classifier with feature identification using LIME, we are able to identify concepts responsible for both correctness and abstention when classifying cancer sites from pathology reports. The improvement of LIME over keyword searches is statistically significant, presumably because words are assessed in context and have been identified as a local determinant of classification.
△ Less
Submitted 21 April, 2022; v1 submitted 10 September, 2020;
originally announced September 2020.