Search | arXiv e-print repository

Impact of Label Noise from Large Language Models Generated Annotations on Evaluation of Diagnostic Model Performance

Authors: Mohammadreza Chavoshi, Hari Trivedi, Janice Newsome, Aawez Mansuri, Chiratidzo Rudado Sanyika, Rohan Satya Isaac, Frank Li, Theo Dapamede, Judy Gichoya

Abstract: Large language models (LLMs) are increasingly used to generate labels from radiology reports to enable large-scale AI evaluation. However, label noise from LLMs can introduce bias into performance estimates, especially under varying disease prevalence and model quality. This study quantifies how LLM labeling errors impact downstream diagnostic model evaluation. We developed a simulation framework… ▽ More Large language models (LLMs) are increasingly used to generate labels from radiology reports to enable large-scale AI evaluation. However, label noise from LLMs can introduce bias into performance estimates, especially under varying disease prevalence and model quality. This study quantifies how LLM labeling errors impact downstream diagnostic model evaluation. We developed a simulation framework to assess how LLM label errors affect observed model performance. A synthetic dataset of 10,000 cases was generated across different prevalence levels. LLM sensitivity and specificity were varied independently between 90% and 100%. We simulated diagnostic models with true sensitivity and specificity ranging from 90% to 100%. Observed performance was computed using LLM-generated labels as the reference. We derived analytical performance bounds and ran 5,000 Monte Carlo trials per condition to estimate empirical uncertainty. Observed performance was highly sensitive to LLM label quality, with bias strongly influenced by disease prevalence. In low-prevalence settings, small reductions in LLM specificity led to substantial underestimation of sensitivity. For example, at 10% prevalence, an LLM with 95% specificity yielded an observed sensitivity of ~53% despite a perfect model. In high-prevalence scenarios, reduced LLM sensitivity caused underestimation of model specificity. Monte Carlo simulations consistently revealed downward bias, with observed performance often falling below true values even when within theoretical bounds. LLM-generated labels can introduce systematic, prevalence-dependent bias into model evaluation. Specificity is more critical in low-prevalence tasks, while sensitivity dominates in high-prevalence settings. These findings highlight the importance of prevalence-aware prompt design and error characterization when using LLMs for post-deployment model assessment in clinical AI. △ Less

Submitted 8 June, 2025; originally announced June 2025.

arXiv:2504.16047 [pdf]

Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis

Authors: Frank Li, Hari Trivedi, Bardia Khosravi, Theo Dapamede, Mohammadreza Chavoshi, Abdulhameed Dere, Rohan Satya Isaac, Aawez Mansuri, Janice Newsome, Saptarshi Purkayastha, Judy Gichoya

Abstract: Foundation models, trained on vast amounts of data using self-supervised techniques, have emerged as a promising frontier for advancing artificial intelligence (AI) applications in medicine. This study evaluates three different vision-language foundation models (RAD-DINO, CheXagent, and BiomedCLIP) on their ability to capture fine-grained imaging features for radiology tasks. The models were asses… ▽ More Foundation models, trained on vast amounts of data using self-supervised techniques, have emerged as a promising frontier for advancing artificial intelligence (AI) applications in medicine. This study evaluates three different vision-language foundation models (RAD-DINO, CheXagent, and BiomedCLIP) on their ability to capture fine-grained imaging features for radiology tasks. The models were assessed across classification, segmentation, and regression tasks for pneumothorax and cardiomegaly on chest radiographs. Self-supervised RAD-DINO consistently excelled in segmentation tasks, while text-supervised CheXagent demonstrated superior classification performance. BiomedCLIP showed inconsistent performance across tasks. A custom segmentation model that integrates global and local features substantially improved performance for all foundation models, particularly for challenging pneumothorax segmentation. The findings highlight that pre-training methodology significantly influences model performance on specific downstream tasks. For fine-grained segmentation tasks, models trained without text supervision performed better, while text-supervised models offered advantages in classification and interpretability. These insights provide guidance for selecting foundation models based on specific clinical applications in radiology. △ Less

Submitted 22 April, 2025; originally announced April 2025.

arXiv:2503.14550 [pdf, other]

Novel AI-Based Quantification of Breast Arterial Calcification to Predict Cardiovascular Risk

Authors: Theodorus Dapamede, Aisha Urooj, Vedant Joshi, Gabrielle Gershon, Frank Li, Mohammadreza Chavoshi, Beatrice Brown-Mulry, Rohan Satya Isaac, Aawez Mansuri, Chad Robichaux, Chadi Ayoub, Reza Arsanjani, Laurence Sperling, Judy Gichoya, Marly van Assen, Charles W. ONeill, Imon Banerjee, Hari Trivedi

Abstract: Women are underdiagnosed and undertreated for cardiovascular disease. Automatic quantification of breast arterial calcification on screening mammography can identify women at risk for cardiovascular disease and enable earlier treatment and management of disease. In this retrospective study of 116,135 women from two healthcare systems, a transformer-based neural network quantified BAC severity (no… ▽ More Women are underdiagnosed and undertreated for cardiovascular disease. Automatic quantification of breast arterial calcification on screening mammography can identify women at risk for cardiovascular disease and enable earlier treatment and management of disease. In this retrospective study of 116,135 women from two healthcare systems, a transformer-based neural network quantified BAC severity (no BAC, mild, moderate, and severe) on screening mammograms. Outcomes included major adverse cardiovascular events (MACE) and all-cause mortality. BAC severity was independently associated with MACE after adjusting for cardiovascular risk factors, with increasing hazard ratios from mild (HR 1.18-1.22), moderate (HR 1.38-1.47), to severe BAC (HR 2.03-2.22) across datasets (all p<0.001). This association remained significant across all age groups, with even mild BAC indicating increased risk in women under 50. BAC remained an independent predictor when analyzed alongside ASCVD risk scores, showing significant associations with myocardial infarction, stroke, heart failure, and mortality (all p<0.005). Automated BAC quantification enables opportunistic cardiovascular risk assessment during routine mammography without additional radiation or cost. This approach provides value beyond traditional risk factors, particularly in younger women, offering potential for early CVD risk stratification in the millions of women undergoing annual mammography. △ Less

Submitted 17 March, 2025; originally announced March 2025.

arXiv:2503.13581 [pdf, other]

Subgroup Performance of a Commercial Digital Breast Tomosynthesis Model for Breast Cancer Detection

Authors: Beatrice Brown-Mulry, Rohan Satya Isaac, Sang Hyup Lee, Ambika Seth, KyungJee Min, Theo Dapamede, Frank Li, Aawez Mansuri, MinJae Woo, Christian Allison Fauria-Robinson, Bhavna Paryani, Judy Wawira Gichoya, Hari Trivedi

Abstract: While research has established the potential of AI models for mammography to improve breast cancer screening outcomes, there have not been any detailed subgroup evaluations performed to assess the strengths and weaknesses of commercial models for digital breast tomosynthesis (DBT) imaging. This study presents a granular evaluation of the Lunit INSIGHT DBT model on a large retrospective cohort of 1… ▽ More While research has established the potential of AI models for mammography to improve breast cancer screening outcomes, there have not been any detailed subgroup evaluations performed to assess the strengths and weaknesses of commercial models for digital breast tomosynthesis (DBT) imaging. This study presents a granular evaluation of the Lunit INSIGHT DBT model on a large retrospective cohort of 163,449 screening mammography exams from the Emory Breast Imaging Dataset (EMBED). Model performance was evaluated in a binary context with various negative exam types (162,081 exams) compared against screen detected cancers (1,368 exams) as the positive class. The analysis was stratified across demographic, imaging, and pathologic subgroups to identify potential disparities. The model achieved an overall AUC of 0.91 (95% CI: 0.90-0.92) with a precision of 0.08 (95% CI: 0.08-0.08), and a recall of 0.73 (95% CI: 0.71-0.76). Performance was found to be robust across demographics, but cases with non-invasive cancers (AUC: 0.85, 95% CI: 0.83-0.87), calcifications (AUC: 0.80, 95% CI: 0.78-0.82), and dense breast tissue (AUC: 0.90, 95% CI: 0.88-0.91) were associated with significantly lower performance compared to other groups. These results highlight the need for detailed evaluation of model characteristics and vigilance in considering adoption of new tools for clinical deployment. △ Less

Submitted 17 March, 2025; originally announced March 2025.

Comments: 14 pages, 7 figures (plus 7 figures in supplement), 3 tables (plus 1 table in supplement)

arXiv:2501.06571 [pdf, other]

Active Rule Mining for Multivariate Anomaly Detection in Radio Access Networks

Authors: Ebenezer R. H. P. Isaac, Joseph H. R. Isaac

Abstract: Multivariate anomaly detection finds its importance in diverse applications. Despite the existence of many detectors to solve this problem, one cannot simply define why an obtained anomaly inferred by the detector is anomalous. This reasoning is required for network operators to understand the root cause of the anomaly and the remedial action that should be taken to counteract its occurrence. Exis… ▽ More Multivariate anomaly detection finds its importance in diverse applications. Despite the existence of many detectors to solve this problem, one cannot simply define why an obtained anomaly inferred by the detector is anomalous. This reasoning is required for network operators to understand the root cause of the anomaly and the remedial action that should be taken to counteract its occurrence. Existing solutions in explainable AI may give cues to features that influence an anomaly, but they do not formulate generalizable rules that can be assessed by a domain expert. Furthermore, not all outliers are anomalous in a business sense. There is an unfulfilled need for a system that can interpret anomalies predicted by a multivariate anomaly detector and map these patterns to actionable rules. This paper aims to fulfill this need by proposing a semi-autonomous anomaly rule miner. The proposed method is applicable to both discrete and time series data and is tailored for radio access network (RAN) anomaly detection use cases. The proposed method is demonstrated in this paper with time series RAN data. △ Less

Submitted 11 January, 2025; originally announced January 2025.

arXiv:2411.07126 [pdf, other]

Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

Authors: NVIDIA, :, Yuval Atzmon, Maciej Bala, Yogesh Balaji, Tiffany Cai, Yin Cui, Jiaojiao Fan, Yunhao Ge, Siddharth Gururani, Jacob Huffman, Ronald Isaac, Pooya Jannaty, Tero Karras, Grace Lam, J. P. Lewis, Aaron Licata, Yen-Chen Lin, Ming-Yu Liu, Qianli Ma, Arun Mallya, Ashlee Martino-Tarr, Doug Mendez, Seungjun Nah, Chris Pruett , et al. (7 additional authors not shown)

Abstract: We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-i… ▽ More We introduce Edify Image, a family of diffusion models capable of generating photorealistic image content with pixel-perfect accuracy. Edify Image utilizes cascaded pixel-space diffusion models trained using a novel Laplacian diffusion process, in which image signals at different frequency bands are attenuated at varying rates. Edify Image supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360 HDR panorama generation, and finetuning for image customization. △ Less

Submitted 11 November, 2024; originally announced November 2024.

arXiv:2401.12198 [pdf, other]

LONEStar: The Lunar Flashlight Optical Navigation Experiment

Authors: Michael Krause, Ava Thrasher, Priyal Soni, Liam Smego, Reuben Isaac, Jennifer Nolan, Micah Pledger, E. Glenn Lightsey, W. Jud Ready, John Christian

Abstract: This paper documents the results from the highly successful Lunar flashlight Optical Navigation Experiment with a Star tracker (LONEStar). Launched in December 2022, Lunar Flashlight (LF) was a NASA-funded technology demonstration mission. After a propulsion system anomaly prevented capture in lunar orbit, LF was ejected from the Earth-Moon system and into heliocentric space. NASA subsequently tra… ▽ More This paper documents the results from the highly successful Lunar flashlight Optical Navigation Experiment with a Star tracker (LONEStar). Launched in December 2022, Lunar Flashlight (LF) was a NASA-funded technology demonstration mission. After a propulsion system anomaly prevented capture in lunar orbit, LF was ejected from the Earth-Moon system and into heliocentric space. NASA subsequently transferred ownership of LF to Georgia Tech to conduct an unfunded extended mission to demonstrate further advanced technology objectives, including LONEStar. From August-December 2023, the LONEStar team performed on-orbit calibration of the optical instrument and a number of different OPNAV experiments. This campaign included the processing of nearly 400 images of star fields, Earth and Moon, and four other planets (Mercury, Mars, Jupiter, and Saturn). LONEStar provided the first on-orbit demonstrations of heliocentric navigation using only optical observations of planets. Of special note is the successful in-flight demonstration of (1) instantaneous triangulation with simultaneous sightings of two planets with the LOST algorithm and (2) dynamic triangulation with sequential sightings of multiple planets. △ Less

Submitted 22 January, 2024; originally announced January 2024.

arXiv:2308.10504 [pdf, other]

Adaptive Thresholding Heuristic for KPI Anomaly Detection

Authors: Ebenezer R. H. P. Isaac, Akshat Sharma

Abstract: A plethora of outlier detectors have been explored in the time series domain, however, in a business sense, not all outliers are anomalies of interest. Existing anomaly detection solutions are confined to certain outlier detectors limiting their applicability to broader anomaly detection use cases. Network KPIs (Key Performance Indicators) tend to exhibit stochastic behaviour producing statistical… ▽ More A plethora of outlier detectors have been explored in the time series domain, however, in a business sense, not all outliers are anomalies of interest. Existing anomaly detection solutions are confined to certain outlier detectors limiting their applicability to broader anomaly detection use cases. Network KPIs (Key Performance Indicators) tend to exhibit stochastic behaviour producing statistical outliers, most of which do not adversely affect business operations. Thus, a heuristic is required to capture the business definition of an anomaly for time series KPI. This article proposes an Adaptive Thresholding Heuristic (ATH) to dynamically adjust the detection threshold based on the local properties of the data distribution and adapt to changes in time series patterns. The heuristic derives the threshold based on the expected periodicity and the observed proportion of anomalies minimizing false positives and addressing concept drift. ATH can be used in conjunction with any underlying seasonality decomposition method and an outlier detector that yields an outlier score. This method has been tested on EON1-Cell-U, a labeled KPI anomaly dataset produced by Ericsson, to validate our hypothesis. Experimental results show that ATH is computationally efficient making it scalable for near real time anomaly detection and flexible with multiple forecasters and outlier detectors. △ Less

Submitted 21 August, 2023; originally announced August 2023.

arXiv:2306.05989 [pdf, other]

QBSD: Quartile-Based Seasonality Decomposition for Cost-Effective RAN KPI Forecasting

Authors: Ebenezer RHP Isaac, Bulbul Singh

Abstract: Forecasting time series patterns, such as cell key performance indicators (KPIs) of radio access networks (RAN), plays a vital role in enhancing service quality and operational efficiency. State-of-the-art forecasting approaches prioritize accuracy at the expense of computational performance, rendering them less suitable for data-intensive applications encompassing systems with a multitude of time… ▽ More Forecasting time series patterns, such as cell key performance indicators (KPIs) of radio access networks (RAN), plays a vital role in enhancing service quality and operational efficiency. State-of-the-art forecasting approaches prioritize accuracy at the expense of computational performance, rendering them less suitable for data-intensive applications encompassing systems with a multitude of time series variables. They also do not capture the effect of dynamic operating ranges that vary with time. To address this issue, we introduce QBSD, a live single-step forecasting approach tailored to optimize the trade-off between accuracy and computational complexity. The method has shown significant success with our real network RAN KPI datasets of over several thousand cells. In this article, we showcase the performance of QBSD in comparison to other forecasting approaches on a dataset we have made publicly available. The results demonstrate that the proposed method excels in runtime efficiency compared to the leading algorithms available while maintaining competitive forecast accuracy that rivals neural forecasting methods. △ Less

Submitted 4 November, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

arXiv:2304.11087 [pdf, other]

AI Product Security: A Primer for Developers

Authors: Ebenezer R. H. P. Isaac, Jim Reno

Abstract: Not too long ago, AI security used to mean the research and practice of how AI can empower cybersecurity, that is, AI for security. Ever since Ian Goodfellow and his team popularized adversarial attacks on machine learning, security for AI became an important concern and also part of AI security. It is imperative to understand the threats to machine learning products and avoid common pitfalls in A… ▽ More Not too long ago, AI security used to mean the research and practice of how AI can empower cybersecurity, that is, AI for security. Ever since Ian Goodfellow and his team popularized adversarial attacks on machine learning, security for AI became an important concern and also part of AI security. It is imperative to understand the threats to machine learning products and avoid common pitfalls in AI product development. This article is addressed to developers, designers, managers and researchers of AI software products. △ Less

Submitted 18 April, 2023; originally announced April 2023.

Comments: 10 pages, 1 figure

arXiv:2303.10338 [pdf]

A general-purpose AI assistant embedded in an open-source radiology information system

Authors: Saptarshi Purkayastha, Rohan Isaac, Sharon Anthony, Shikhar Shukla, Elizabeth A. Krupinski, Joshua A. Danish, Judy W. Gichoya

Abstract: Radiology AI models have made significant progress in near-human performance or surpassing it. However, AI model's partnership with human radiologist remains an unexplored challenge due to the lack of health information standards, contextual and workflow differences, and data labeling variations. To overcome these challenges, we integrated an AI model service that uses DICOM standard SR annotation… ▽ More Radiology AI models have made significant progress in near-human performance or surpassing it. However, AI model's partnership with human radiologist remains an unexplored challenge due to the lack of health information standards, contextual and workflow differences, and data labeling variations. To overcome these challenges, we integrated an AI model service that uses DICOM standard SR annotations into the OHIF viewer in the open-source LibreHealth Radiology Information Systems (RIS). In this paper, we describe the novel Human-AI partnership capabilities of the platform, including few-shot learning and swarm learning approaches to retrain the AI models continuously. Building on the concept of machine teaching, we developed an active learning strategy within the RIS, so that the human radiologist can enable/disable AI annotations as well as "fix"/relabel the AI annotations. These annotations are then used to retrain the models. This helps establish a partnership between the radiologist user and a user-specific AI model. The weights of these user-specific models are then finally shared between multiple models in a swarm learning approach. △ Less

Submitted 18 March, 2023; originally announced March 2023.

Comments: Full research paper version of the demo paper accepted at the AIME 2023 - 21st International Conference of Artificial Intelligence in Medicine

arXiv:2111.06670 [pdf, other]

Robust Analytics for Video-Based Gait Biometrics

Authors: Ebenezer R. H. P. Isaac

Abstract: Gait analysis is the study of the systematic methods that assess and quantify animal locomotion. Gait finds a unique importance among the many state-of-the-art biometric systems since it does not require the subject's cooperation to the extent required by other modalities. Hence by nature, it is an unobtrusive biometric. This thesis discusses both hard and soft biometric characteristics of gait.… ▽ More Gait analysis is the study of the systematic methods that assess and quantify animal locomotion. Gait finds a unique importance among the many state-of-the-art biometric systems since it does not require the subject's cooperation to the extent required by other modalities. Hence by nature, it is an unobtrusive biometric. This thesis discusses both hard and soft biometric characteristics of gait. It shows how to identify gender based on gait alone through the Posed-Based Voting scheme. It then describes improving gait recognition accuracy using Genetic Template Segmentation. Members of a wide population can be authenticated using Multiperson Signature Mapping. Finally, the mapping can be improved in a smaller population using Bayesian Thresholding. All methods proposed in this thesis have outperformed their existing state of the art with adequate experimentation and results. △ Less

Submitted 12 November, 2021; originally announced November 2021.

Comments: Ph.D. Thesis, Anna University, Chennai, Feb. 2018

arXiv:1903.10744 [pdf, other]

Trait of Gait: A Survey on Gait Biometrics

Authors: Ebenezer R. H. P. Isaac, Susan Elias, Srinivasan Rajagopalan, K. S. Easwarakumar

Abstract: Gait analysis is the study of the systematic methods that assess and quantify animal locomotion. The research on gait analysis has considerably evolved through time. It was an ancient art, and it still finds its application today in modern science and medicine. This paper describes how one's gait can be used as a biometric. It shall diversely cover salient research done within the field and explai… ▽ More Gait analysis is the study of the systematic methods that assess and quantify animal locomotion. The research on gait analysis has considerably evolved through time. It was an ancient art, and it still finds its application today in modern science and medicine. This paper describes how one's gait can be used as a biometric. It shall diversely cover salient research done within the field and explain the nuances and advances in each type of gait analysis. The prominent methods of gait recognition from the early era to the state of the art are covered. This survey also reviews the various gait datasets. The overall aim of this study is to provide a concise roadmap for anyone who wishes to do research in the field of gait biometrics. △ Less

Submitted 26 March, 2019; originally announced March 2019.

arXiv:1309.0666 [pdf, ps, other]

A proof that the square root of s for s not a perfect square is simply normal to base 2

Authors: Richard Isaac

Abstract: Since E. Borel proved in 1909 that almost all real numbers with respect to Lebesgue measure are normal to all bases, an open problem has been whether simple irrationals like square root of 2 are normal to any base. We show that each number of the form square root of s for s not a perfect square is simply normal to base 2, that is, the averages of the first n digits of its dyadic expansion converge… ▽ More Since E. Borel proved in 1909 that almost all real numbers with respect to Lebesgue measure are normal to all bases, an open problem has been whether simple irrationals like square root of 2 are normal to any base. We show that each number of the form square root of s for s not a perfect square is simply normal to base 2, that is, the averages of the first n digits of its dyadic expansion converge to 1/2. The proof is mostly elementary and self contained but some basic probability is used. The main idea centers on the notion of tails of an expansion, that is, the sequence of digits with index larger than any fixed integer n. △ Less

Submitted 19 September, 2018; v1 submitted 3 September, 2013; originally announced September 2013.

Comments: 13 pages, Section 2.3 has been rewritten, with a section on basic probability, allowing arguments to be given at a more advanced level and with more detail. Correction: The two probabilities given in the previous version (pp. 7,8) are not defined on an adequate sample space. Section 2.3.3 replaces it with an appropriate one

MSC Class: 11K16

arXiv:1104.1616 [pdf, ps, other]

On a sufficient condition that the square root of s is simply normal to base 2, for s not a perfect square

Authors: Richard Isaac

Abstract: A simple proof is given of a sufficient condition that the square root of s is simply normal to base 2, for s not a perfect square. This relates to previous work of the author. A simple proof is given of a sufficient condition that the square root of s is simply normal to base 2, for s not a perfect square. This relates to previous work of the author. △ Less

Submitted 8 April, 2011; originally announced April 2011.

Comments: 9 pages

MSC Class: 11K16

arXiv:math/0512404 [pdf, ps, other]

On the simple normality to base 2 of the square root of s, for s not a perfect square

Authors: Richard Isaac

Abstract: We show that each number of the form, the square root of s for s not a perfect square, is simply normal to the base 2. The argument uses some elementary ideas from the calculus of finite differences. We show that each number of the form, the square root of s for s not a perfect square, is simply normal to the base 2. The argument uses some elementary ideas from the calculus of finite differences. △ Less

Submitted 21 September, 2006; v1 submitted 16 December, 2005; originally announced December 2005.

Comments: 14 pages; Lemma 6 of the original version is incorrect. This revision provides an alternative argument to get the desired result. Except for minor modifications, the revision agrees with the original through and including lemma 4. The alternative argument begins in the revision after the conclusion of Lemma 4. Additional modifications include correction of a few typos, rephrasing of some exposition, and the clarification of confusing notation for the partial difference on p. 7

MSC Class: 11K16

Showing 1–16 of 16 results for author: Isaac, R