-
RadGPT: Constructing 3D Image-Text Tumor Datasets
Authors:
Pedro R. A. S. Bassi,
Mehmet Can Yavuz,
Kang Wang,
Xiaoxi Chen,
Wenxuan Li,
Sergio Decherchi,
Andrea Cavalli,
Yang Yang,
Alan Yuille,
Zongwei Zhou
Abstract:
With over 85 million CT scans performed annually in the United States, creating tumor-related reports is a challenging and time-consuming task for radiologists. To address this need, we present RadGPT, an Anatomy-Aware Vision-Language AI Agent for generating detailed reports from CT scans. RadGPT first segments tumors, including benign cysts and malignant tumors, and their surrounding anatomical s…
▽ More
With over 85 million CT scans performed annually in the United States, creating tumor-related reports is a challenging and time-consuming task for radiologists. To address this need, we present RadGPT, an Anatomy-Aware Vision-Language AI Agent for generating detailed reports from CT scans. RadGPT first segments tumors, including benign cysts and malignant tumors, and their surrounding anatomical structures, then transforms this information into both structured reports and narrative reports. These reports provide tumor size, shape, location, attenuation, volume, and interactions with surrounding blood vessels and organs. Extensive evaluation on unseen hospitals shows that RadGPT can produce accurate reports, with high sensitivity/specificity for small tumor (<2 cm) detection: 80/73% for liver tumors, 92/78% for kidney tumors, and 77/77% for pancreatic tumors. For large tumors, sensitivity ranges from 89% to 97%. The results significantly surpass the state-of-the-art in abdominal CT report generation.
RadGPT generated reports for 17 public datasets. Through radiologist review and refinement, we have ensured the reports' accuracy, and created the first publicly available image-text 3D medical dataset, comprising over 1.8 million text tokens and 2.7 million images from 9,262 CT scans, including 2,947 tumor scans/reports of 8,562 tumor instances. Our reports can: (1) localize tumors in eight liver sub-segments and three pancreatic sub-segments annotated per-voxel; (2) determine pancreatic tumor stage (T1-T4) in 260 reports; and (3) present individual analyses of multiple tumors--rare in human-made reports. Importantly, 948 of the reports are for early-stage tumors.
△ Less
Submitted 8 January, 2025;
originally announced January 2025.
-
Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?
Authors:
Pedro R. A. S. Bassi,
Wenxuan Li,
Yucheng Tang,
Fabian Isensee,
Zifu Wang,
Jieneng Chen,
Yu-Cheng Chou,
Yannick Kirchhoff,
Maximilian Rokuss,
Ziyan Huang,
Jin Ye,
Junjun He,
Tassilo Wald,
Constantin Ulrich,
Michael Baumgartner,
Saikat Roy,
Klaus H. Maier-Hein,
Paul Jaeger,
Yiwen Ye,
Yutong Xie,
Jianpeng Zhang,
Ziyang Chen,
Yong Xia,
Zhaohu Xing,
Lei Zhu
, et al. (28 additional authors not shown)
Abstract:
How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone…
▽ More
How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.
△ Less
Submitted 19 January, 2025; v1 submitted 6 November, 2024;
originally announced November 2024.
-
Label Critic: Design Data Before Models
Authors:
Pedro R. A. S. Bassi,
Qilong Wu,
Wenxuan Li,
Sergio Decherchi,
Andrea Cavalli,
Alan Yuille,
Zongwei Zhou
Abstract:
As medical datasets rapidly expand, creating detailed annotations of different body structures becomes increasingly expensive and time-consuming. We consider that requesting radiologists to create detailed annotations is unnecessarily burdensome and that pre-existing AI models can largely automate this process. Following the spirit don't use a sledgehammer on a nut, we find that, rather than creat…
▽ More
As medical datasets rapidly expand, creating detailed annotations of different body structures becomes increasingly expensive and time-consuming. We consider that requesting radiologists to create detailed annotations is unnecessarily burdensome and that pre-existing AI models can largely automate this process. Following the spirit don't use a sledgehammer on a nut, we find that, rather than creating annotations from scratch, radiologists only have to review and edit errors if the Best-AI Labels have mistakes. To obtain the Best-AI Labels among multiple AI Labels, we developed an automatic tool, called Label Critic, that can assess label quality through tireless pairwise comparisons. Extensive experiments demonstrate that, when incorporated with our developed Image-Prompt pairs, pre-existing Large Vision-Language Models (LVLM), trained on natural images and texts, achieve 96.5% accuracy when choosing the best label in a pair-wise comparison, without extra fine-tuning. By transforming the manual annotation task (30-60 min/scan) into an automatic comparison task (15 sec/scan), we effectively reduce the manual efforts required from radiologists by an order of magnitude. When the Best-AI Labels are sufficiently accurate (81% depending on body structures), they will be directly adopted as the gold-standard annotations for the dataset, with lower-quality AI Labels automatically discarded. Label Critic can also check the label quality of a single AI Label with 71.8% accuracy when no alternatives are available for comparison, prompting radiologists to review and edit if the estimated quality is low (19% depending on body structures).
△ Less
Submitted 4 November, 2024;
originally announced November 2024.
-
Explanation is All You Need in Distillation: Mitigating Bias and Shortcut Learning
Authors:
Pedro R. A. S. Bassi,
Andrea Cavalli,
Sergio Decherchi
Abstract:
Bias and spurious correlations in data can cause shortcut learning, undermining out-of-distribution (OOD) generalization in deep neural networks. Most methods require unbiased data during training (and/or hyper-parameter tuning) to counteract shortcut learning. Here, we propose the use of explanation distillation to hinder shortcut learning. The technique does not assume any access to unbiased dat…
▽ More
Bias and spurious correlations in data can cause shortcut learning, undermining out-of-distribution (OOD) generalization in deep neural networks. Most methods require unbiased data during training (and/or hyper-parameter tuning) to counteract shortcut learning. Here, we propose the use of explanation distillation to hinder shortcut learning. The technique does not assume any access to unbiased data, and it allows an arbitrarily sized student network to learn the reasons behind the decisions of an unbiased teacher, such as a vision-language model or a network processing debiased images. We found that it is possible to train a neural network with explanation (e.g by Layer Relevance Propagation, LRP) distillation only, and that the technique leads to high resistance to shortcut learning, surpassing group-invariant learning, explanation background minimization, and alternative distillation techniques. In the COLOURED MNIST dataset, LRP distillation achieved 98.2% OOD accuracy, while deep feature distillation and IRM achieved 92.1% and 60.2%, respectively. In COCO-on-Places, the undesirable generalization gap between in-distribution and OOD accuracy is only of 4.4% for LRP distillation, while the other two techniques present gaps of 15.1% and 52.1%, respectively.
△ Less
Submitted 13 July, 2024;
originally announced July 2024.
-
Faster ISNet for Background Bias Mitigation on Deep Neural Networks
Authors:
Pedro R. A. S. Bassi,
Sergio Decherchi,
Andrea Cavalli
Abstract:
Bias or spurious correlations in image backgrounds can impact neural networks, causing shortcut learning (Clever Hans Effect) and hampering generalization to real-world data. ISNet, a recently introduced architecture, proposed the optimization of Layer-Wise Relevance Propagation (LRP, an explanation technique) heatmaps, to mitigate the influence of backgrounds on deep classifiers. However, ISNet's…
▽ More
Bias or spurious correlations in image backgrounds can impact neural networks, causing shortcut learning (Clever Hans Effect) and hampering generalization to real-world data. ISNet, a recently introduced architecture, proposed the optimization of Layer-Wise Relevance Propagation (LRP, an explanation technique) heatmaps, to mitigate the influence of backgrounds on deep classifiers. However, ISNet's training time scales linearly with the number of classes in an application. Here, we propose reformulated architectures whose training time becomes independent from this number. Additionally, we introduce a concise and model-agnostic LRP implementation. We challenge the proposed architectures using synthetic background bias, and COVID-19 detection in chest X-rays, an application that commonly presents background bias. The networks hindered background attention and shortcut learning, surpassing multiple state-of-the-art models on out-of-distribution test datasets. Representing a potentially massive training speed improvement over ISNet, the proposed architectures introduce LRP optimization into a gamut of applications that the original model cannot feasibly handle.
△ Less
Submitted 31 March, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
5Greplay: a 5G Network Traffic Fuzzer -- Application to Attack Injection
Authors:
Zujany Salazar,
Huu Nghia Nguyen,
Wissam Mallouli,
Ana R Cavalli,
Edgardo Montes de Oca
Abstract:
The fifth generation of mobile broadband is more than just an evolution to provide more mobile bandwidth, massive machine-type communications, and ultra-reliable and low-latency communications. It relies on a complex, dynamic and heterogeneous environment that implies addressing numerous testing and security challenges. In this paper we present 5Greplay, an open-source 5G network traffic fuzzer th…
▽ More
The fifth generation of mobile broadband is more than just an evolution to provide more mobile bandwidth, massive machine-type communications, and ultra-reliable and low-latency communications. It relies on a complex, dynamic and heterogeneous environment that implies addressing numerous testing and security challenges. In this paper we present 5Greplay, an open-source 5G network traffic fuzzer that enables the evaluation of 5G components by replaying and modifying 5G network traffic by creating and injecting network scenarios into a target that can be a 5G core service (e.g., AMF, SMF) or a RAN network (e.g., gNodeB). The tool provides the ability to alter network packets online or offline in both control and data planes in a very flexible manner. The experimental evaluation conducted against open-source based 5G platforms, showed that the target services accept traffic being altered by the tool, and that it can reach up to 9.56 Gbps using only 1 processor core to replay 5G traffic.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
A Survey on Cyber-Resilience Approaches for Cyber-Physical Systems
Authors:
Mariana Segovia-Ferreira,
Jose Rubio-Hernan,
Ana Rosa Cavalli,
Joaquin Garcia-Alfaro
Abstract:
Concerns for the resilience of Cyber-Physical Systems (CPS)s in critical infrastructure are growing. CPS integrate sensing, computation, control, and networking into physical objects and mission-critical services, connecting traditional infrastructure to internet technologies. While this integration increases service efficiency, it has to face the possibility of new threats posed by the new functi…
▽ More
Concerns for the resilience of Cyber-Physical Systems (CPS)s in critical infrastructure are growing. CPS integrate sensing, computation, control, and networking into physical objects and mission-critical services, connecting traditional infrastructure to internet technologies. While this integration increases service efficiency, it has to face the possibility of new threats posed by the new functionalities. This leads to cyber-threats, such as denial-of-service, modification of data, information leakage, spreading of malware, and many others. Cyber-resilience refers to the ability of a CPS to prepare, absorb, recover, and adapt to the adverse effects associated with cyber-threats, e.g., physical degradation of the CPS performance resulting from a cyber-attack. Cyber-resilience aims at ensuring CPS survival by keeping the core functionalities of the CPS in case of extreme events. The literature on cyber-resilience is rapidly increasing, leading to a broad variety of research works addressing this new topic. In this article, we create a systematization of knowledge about existing scientific efforts of making CPSs cyber-resilient. We systematically survey recent literature addressing cyber-resilience with a focus on techniques that may be used on CPSs. We first provide preliminaries and background on CPSs and threats, and subsequently survey state-of-the-art approaches that have been proposed by recent research work applicable to CPSs. In particular, we aim at differentiating research work from traditional risk management approaches based on the general acceptance that it is unfeasible to prevent and mitigate all possible risks threatening a CPS. We also discuss questions and research challenges, with a focus on the practical aspects of cyber-resilience, such as the use of metrics and evaluation methods as well as testing and validation environments.
△ Less
Submitted 16 May, 2024; v1 submitted 10 February, 2023;
originally announced February 2023.
-
Improving deep neural network generalization and robustness to background bias via layer-wise relevance propagation optimization
Authors:
Pedro R. A. S. Bassi,
Sergio S. J. Dertkigil,
Andrea Cavalli
Abstract:
Features in images' backgrounds can spuriously correlate with the images' classes, representing background bias. They can influence the classifier's decisions, causing shortcut learning (Clever Hans effect). The phenomenon generates deep neural networks (DNNs) that perform well on standard evaluation datasets but generalize poorly to real-world data. Layer-wise Relevance Propagation (LRP) explains…
▽ More
Features in images' backgrounds can spuriously correlate with the images' classes, representing background bias. They can influence the classifier's decisions, causing shortcut learning (Clever Hans effect). The phenomenon generates deep neural networks (DNNs) that perform well on standard evaluation datasets but generalize poorly to real-world data. Layer-wise Relevance Propagation (LRP) explains DNNs' decisions. Here, we show that the optimization of LRP heatmaps can minimize the background bias influence on deep classifiers, hindering shortcut learning. By not increasing run-time computational cost, the approach is light and fast. Furthermore, it applies to virtually any classification architecture. After injecting synthetic bias in images' backgrounds, we compared our approach (dubbed ISNet) to eight state-of-the-art DNNs, quantitatively demonstrating its superior robustness to background bias. Mixed datasets are common for COVID-19 and tuberculosis classification with chest X-rays, fostering background bias. By focusing on the lungs, the ISNet reduced shortcut learning. Thus, its generalization performance on external (out-of-distribution) test databases significantly surpassed all implemented benchmark models.
△ Less
Submitted 10 January, 2024; v1 submitted 1 February, 2022;
originally announced February 2022.
-
Cyber-Resilience Evaluation of Cyber-Physical Systems
Authors:
Mariana Segovia,
Jose Rubio-Hernan,
Ana Rosa Cavalli,
Joaquin Garcia-Alfaro
Abstract:
Cyber-Physical Systems (CPS) use computational resources to control physical process and provide critical services. For this reason, an attack in these systems may have dangerous consequences in the physical world. Hence, resilience is a fundamental property to ensure the safety of the people, the environment and the controlled physical process. In this paper, we present metrics to quantify the re…
▽ More
Cyber-Physical Systems (CPS) use computational resources to control physical process and provide critical services. For this reason, an attack in these systems may have dangerous consequences in the physical world. Hence, resilience is a fundamental property to ensure the safety of the people, the environment and the controlled physical process. In this paper, we present metrics to quantify the resilience level based on the design, structure, stability, and performance under the attack of a given CPS. The metrics provide reference points to evaluate whether the system is better prepared or not to face the adversaries. This way, it is possible to quantify the ability to recover from an adversary using its mathematical model based on switched linear systems and actuators saturation. Finally, we validate our approach using a numeric simulation on the Tennesse Eastman control challenge problem.
△ Less
Submitted 15 September, 2020;
originally announced September 2020.
-
Testing Security Policies for Distributed Systems: Vehicular Networks as a Case Study
Authors:
Mohamed H. E. Aouadi,
Khalifa Toumi,
Ana Cavalli
Abstract:
Due to the increasing complexity of distributed systems, security testing is becoming increasingly critical in insuring reliability of such systems in relation to their security requirements. . To challenge this issue, we rely in this paper1 on model based active testing. In this paper we propose a framework to specify security policies and test their implementation. Our framework makes it possibl…
▽ More
Due to the increasing complexity of distributed systems, security testing is becoming increasingly critical in insuring reliability of such systems in relation to their security requirements. . To challenge this issue, we rely in this paper1 on model based active testing. In this paper we propose a framework to specify security policies and test their implementation. Our framework makes it possible to automatically generate test sequences, in order to validate the conformance of a security policy. This framework contains several new methods to ease the test case generation. To demonstrate the reliability of our framework, we present a Vehicular Networks System as an ongoing case study.
△ Less
Submitted 17 October, 2014;
originally announced October 2014.