-
Differential Privacy for Deep Learning in Medicine
Authors:
Marziyeh Mohammadi,
Mohsen Vejdanihemmat,
Mahshad Lotfinia,
Mirabela Rusu,
Daniel Truhn,
Andreas Maier,
Soroosh Tayebi Arasteh
Abstract:
Differential privacy (DP) is a key technique for protecting sensitive patient data in medical deep learning (DL). As clinical models grow more data-dependent, balancing privacy with utility and fairness has become a critical challenge. This scoping review synthesizes recent developments in applying DP to medical DL, with a particular focus on DP-SGD and alternative mechanisms across centralized an…
▽ More
Differential privacy (DP) is a key technique for protecting sensitive patient data in medical deep learning (DL). As clinical models grow more data-dependent, balancing privacy with utility and fairness has become a critical challenge. This scoping review synthesizes recent developments in applying DP to medical DL, with a particular focus on DP-SGD and alternative mechanisms across centralized and federated settings. Using a structured search strategy, we identified 74 studies published up to March 2025. Our analysis spans diverse data modalities, training setups, and downstream tasks, and highlights the tradeoffs between privacy guarantees, model accuracy, and subgroup fairness. We find that while DP-especially at strong privacy budgets-can preserve performance in well-structured imaging tasks, severe degradation often occurs under strict privacy, particularly in underrepresented or complex modalities. Furthermore, privacy-induced performance gaps disproportionately affect demographic subgroups, with fairness impacts varying by data type and task. A small subset of studies explicitly addresses these tradeoffs through subgroup analysis or fairness metrics, but most omit them entirely. Beyond DP-SGD, emerging approaches leverage alternative mechanisms, generative models, and hybrid federated designs, though reporting remains inconsistent. We conclude by outlining key gaps in fairness auditing, standardization, and evaluation protocols, offering guidance for future work toward equitable and clinically robust privacy-preserving DL systems in medicine.
△ Less
Submitted 31 May, 2025;
originally announced June 2025.
-
Perceptual Implications of Automatic Anonymization in Pathological Speech
Authors:
Soroosh Tayebi Arasteh,
Saba Afza,
Tri-Thien Nguyen,
Lukas Buess,
Maryam Parvin,
Tomas Arias-Vergara,
Paula Andrea Perez-Toro,
Hiu Ching Hung,
Mahshad Lotfinia,
Thomas Gorges,
Elmar Noeth,
Maria Schuster,
Seung Hee Yang,
Andreas Maier
Abstract:
Automatic anonymization techniques are essential for ethical sharing of pathological speech data, yet their perceptual consequences remain understudied. This study presents the first comprehensive human-centered analysis of anonymized pathological speech, using a structured perceptual protocol involving ten native and non-native German listeners with diverse linguistic, clinical, and technical bac…
▽ More
Automatic anonymization techniques are essential for ethical sharing of pathological speech data, yet their perceptual consequences remain understudied. This study presents the first comprehensive human-centered analysis of anonymized pathological speech, using a structured perceptual protocol involving ten native and non-native German listeners with diverse linguistic, clinical, and technical backgrounds. Listeners evaluated anonymized-original utterance pairs from 180 speakers spanning Cleft Lip and Palate, Dysarthria, Dysglossia, Dysphonia, and age-matched healthy controls. Speech was anonymized using state-of-the-art automatic methods (equal error rates in the range of 30-40%). Listeners completed Turing-style discrimination and quality rating tasks under zero-shot (single-exposure) and few-shot (repeated-exposure) conditions. Discrimination accuracy was high overall (91% zero-shot; 93% few-shot), but varied by disorder (repeated-measures ANOVA: p=0.007), ranging from 96% (Dysarthria) to 86% (Dysphonia). Anonymization consistently reduced perceived quality (from 83% to 59%, p<0.001), with pathology-specific degradation patterns (one-way ANOVA: p=0.005). Native listeners rated original speech slightly higher than non-native listeners (Delta=4%, p=0.199), but this difference nearly disappeared after anonymization (Delta=1%, p=0.724). No significant gender-based bias was observed. Critically, human perceptual outcomes did not correlate with automatic privacy or clinical utility metrics. These results underscore the need for listener-informed, disorder- and context-specific anonymization strategies that preserve privacy while maintaining interpretability, communicative functions, and diagnostic utility, especially for vulnerable populations such as children.
△ Less
Submitted 1 May, 2025;
originally announced May 2025.
-
Boosting multi-demographic federated learning for chest radiograph analysis using general-purpose self-supervised representations
Authors:
Mahshad Lotfinia,
Arash Tayebiarasteh,
Samaneh Samiei,
Mehdi Joodaki,
Soroosh Tayebi Arasteh
Abstract:
Reliable artificial intelligence (AI) models for medical image analysis often depend on large and diverse labeled datasets. Federated learning (FL) offers a decentralized and privacy-preserving approach to training but struggles in highly non-independent and identically distributed (non-IID) settings, where institutions with more representative data may experience degraded performance. Moreover, e…
▽ More
Reliable artificial intelligence (AI) models for medical image analysis often depend on large and diverse labeled datasets. Federated learning (FL) offers a decentralized and privacy-preserving approach to training but struggles in highly non-independent and identically distributed (non-IID) settings, where institutions with more representative data may experience degraded performance. Moreover, existing large-scale FL studies have been limited to adult datasets, neglecting the unique challenges posed by pediatric data, which introduces additional non-IID variability. To address these limitations, we analyzed n=398,523 adult chest radiographs from diverse institutions across multiple countries and n=9,125 pediatric images, leveraging transfer learning from general-purpose self-supervised image representations to classify pneumonia and cases with no abnormality. Using state-of-the-art vision transformers, we found that FL improved performance only for smaller adult datasets (P<0.001) but degraded performance for larger datasets (P<0.064) and pediatric cases (P=0.242). However, equipping FL with self-supervised weights significantly enhanced outcomes across pediatric cases (P=0.031) and most adult datasets (P<0.008), except the largest dataset (P=0.052). These findings underscore the potential of easily deployable general-purpose self-supervised image representations to address non-IID challenges in clinical FL applications and highlight their promise for enhancing patient outcomes and advancing pediatric healthcare, where data scarcity and variability remain persistent obstacles.
△ Less
Submitted 19 June, 2025; v1 submitted 11 April, 2025;
originally announced April 2025.
-
Differential privacy enables fair and accurate AI-based analysis of speech disorders while protecting patient data
Authors:
Soroosh Tayebi Arasteh,
Mahshad Lotfinia,
Paula Andrea Perez-Toro,
Tomas Arias-Vergara,
Mahtab Ranji,
Juan Rafael Orozco-Arroyave,
Maria Schuster,
Andreas Maier,
Seung Hee Yang
Abstract:
Speech pathology has impacts on communication abilities and quality of life. While deep learning-based models have shown potential in diagnosing these disorders, the use of sensitive data raises critical privacy concerns. Although differential privacy (DP) has been explored in the medical imaging domain, its application in pathological speech analysis remains largely unexplored despite the equally…
▽ More
Speech pathology has impacts on communication abilities and quality of life. While deep learning-based models have shown potential in diagnosing these disorders, the use of sensitive data raises critical privacy concerns. Although differential privacy (DP) has been explored in the medical imaging domain, its application in pathological speech analysis remains largely unexplored despite the equally critical privacy concerns. To the best of our knowledge, this study is the first to investigate DP's impact on pathological speech data, focusing on the trade-offs between privacy, diagnostic accuracy, and fairness. Using a large, real-world dataset of 200 hours of recordings from 2,839 German-speaking participants, we observed a maximum accuracy reduction of 3.85% when training with DP with high privacy levels. To highlight real-world privacy risks, we demonstrated the vulnerability of non-private models to gradient inversion attacks, reconstructing identifiable speech samples and showcasing DP's effectiveness in mitigating these risks. To explore the potential generalizability across languages and disorders, we validated our approach on a dataset of Spanish-speaking Parkinson's disease patients, leveraging pretrained models from healthy English-speaking datasets, and demonstrated that careful pretraining on large-scale task-specific datasets can maintain favorable accuracy under DP constraints. A comprehensive fairness analysis revealed minimal gender bias at reasonable privacy levels but underscored the need for addressing age-related disparities. Our results establish that DP can balance privacy and utility in speech disorder detection, while highlighting unique challenges in privacy-fairness trade-offs for speech data. This provides a foundation for refining DP methodologies and improving fairness across diverse patient groups in real-world deployments.
△ Less
Submitted 30 May, 2025; v1 submitted 27 September, 2024;
originally announced September 2024.
-
RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering
Authors:
Soroosh Tayebi Arasteh,
Mahshad Lotfinia,
Keno Bressem,
Robert Siepmann,
Lisa Adams,
Dyke Ferber,
Christiane Kuhl,
Jakob Nikolas Kather,
Sven Nebelung,
Daniel Truhn
Abstract:
Large language models (LLMs) often generate outdated or inaccurate information based on static training datasets. Retrieval-augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative r…
▽ More
Large language models (LLMs) often generate outdated or inaccurate information based on static training datasets. Retrieval-augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG), an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from the RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions with reference standard answers, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG in a zero-shot inference scenario RadioRAG retrieved context-specific information from Radiopaedia in real-time. Accuracy was investigated. Statistical analyses were performed using bootstrapping. The results were further compared with human performance. RadioRAG improved diagnostic accuracy across most LLMs, with relative accuracy increases ranging up to 54% for different LLMs. It matched or exceeded non-RAG models and the human radiologist in question answering across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, the degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in RadioRAG's effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. RadioRAG shows potential to improve LLM accuracy and factuality in radiology question answering by integrating real-time domain-specific data.
△ Less
Submitted 18 June, 2025; v1 submitted 22 July, 2024;
originally announced July 2024.
-
Large Language Models Streamline Automated Machine Learning for Clinical Studies
Authors:
Soroosh Tayebi Arasteh,
Tianyu Han,
Mahshad Lotfinia,
Christiane Kuhl,
Jakob Nikolas Kather,
Daniel Truhn,
Sven Nebelung
Abstract:
A knowledge gap persists between machine learning (ML) developers (e.g., data scientists) and practitioners (e.g., clinicians), hampering the full utilization of ML for clinical data analysis. We investigated the potential of the ChatGPT Advanced Data Analysis (ADA), an extension of GPT-4, to bridge this gap and perform ML analyses efficiently. Real-world clinical datasets and study details from l…
▽ More
A knowledge gap persists between machine learning (ML) developers (e.g., data scientists) and practitioners (e.g., clinicians), hampering the full utilization of ML for clinical data analysis. We investigated the potential of the ChatGPT Advanced Data Analysis (ADA), an extension of GPT-4, to bridge this gap and perform ML analyses efficiently. Real-world clinical datasets and study details from large trials across various medical specialties were presented to ChatGPT ADA without specific guidance. ChatGPT ADA autonomously developed state-of-the-art ML models based on the original study's training data to predict clinical outcomes such as cancer development, cancer progression, disease complications, or biomarkers such as pathogenic gene sequences. Following the re-implementation and optimization of the published models, the head-to-head comparison of the ChatGPT ADA-crafted ML models and their respective manually crafted counterparts revealed no significant differences in traditional performance metrics (P>0.071). Strikingly, the ChatGPT ADA-crafted ML models often outperformed their counterparts. In conclusion, ChatGPT ADA offers a promising avenue to democratize ML in medicine by simplifying complex data analyses, yet should enhance, not replace, specialized training and resources, to promote broader applications in medical research and practice.
△ Less
Submitted 21 February, 2024; v1 submitted 27 August, 2023;
originally announced August 2023.
-
Preserving privacy in domain transfer of medical AI models comes at no performance costs: The integral role of differential privacy
Authors:
Soroosh Tayebi Arasteh,
Mahshad Lotfinia,
Teresa Nolte,
Marwin Saehn,
Peter Isfort,
Christiane Kuhl,
Sven Nebelung,
Georgios Kaissis,
Daniel Truhn
Abstract:
Developing robust and effective artificial intelligence (AI) models in medicine requires access to large amounts of patient data. The use of AI models solely trained on large multi-institutional datasets can help with this, yet the imperative to ensure data privacy remains, particularly as membership inference risks breaching patient confidentiality. As a proposed remedy, we advocate for the integ…
▽ More
Developing robust and effective artificial intelligence (AI) models in medicine requires access to large amounts of patient data. The use of AI models solely trained on large multi-institutional datasets can help with this, yet the imperative to ensure data privacy remains, particularly as membership inference risks breaching patient confidentiality. As a proposed remedy, we advocate for the integration of differential privacy (DP). We specifically investigate the performance of models trained with DP as compared to models trained without DP on data from institutions that the model had not seen during its training (i.e., external validation) - the situation that is reflective of the clinical use of AI models. By leveraging more than 590,000 chest radiographs from five institutions, we evaluated the efficacy of DP-enhanced domain transfer (DP-DT) in diagnosing cardiomegaly, pleural effusion, pneumonia, atelectasis, and in identifying healthy subjects. We juxtaposed DP-DT with non-DP-DT and examined diagnostic accuracy and demographic fairness using the area under the receiver operating characteristic curve (AUC) as the main metric, as well as accuracy, sensitivity, and specificity. Our results show that DP-DT, even with exceptionally high privacy levels (epsilon around 1), performs comparably to non-DP-DT (P>0.119 across all domains). Furthermore, DP-DT led to marginal AUC differences - less than 1% - for nearly all subgroups, relative to non-DP-DT. Despite consistent evidence suggesting that DP models induce significant performance degradation for on-domain applications, we show that off-domain performance is almost not affected. Therefore, we ardently advocate for the adoption of DP in training diagnostic medical AI models, given its minimal impact on performance.
△ Less
Submitted 7 December, 2023; v1 submitted 10 June, 2023;
originally announced June 2023.
-
How Will Your Tweet Be Received? Predicting the Sentiment Polarity of Tweet Replies
Authors:
Soroosh Tayebi Arasteh,
Mehrpad Monajem,
Vincent Christlein,
Philipp Heinrich,
Anguelos Nicolaou,
Hamidreza Naderi Boldaji,
Mahshad Lotfinia,
Stefan Evert
Abstract:
Twitter sentiment analysis, which often focuses on predicting the polarity of tweets, has attracted increasing attention over the last years, in particular with the rise of deep learning (DL). In this paper, we propose a new task: predicting the predominant sentiment among (first-order) replies to a given tweet. Therefore, we created RETWEET, a large dataset of tweets and replies manually annotate…
▽ More
Twitter sentiment analysis, which often focuses on predicting the polarity of tweets, has attracted increasing attention over the last years, in particular with the rise of deep learning (DL). In this paper, we propose a new task: predicting the predominant sentiment among (first-order) replies to a given tweet. Therefore, we created RETWEET, a large dataset of tweets and replies manually annotated with sentiment labels. As a strong baseline, we propose a two-stage DL-based method: first, we create automatically labeled training data by applying a standard sentiment classifier to tweet replies and aggregating its predictions for each original tweet; our rationale is that individual errors made by the classifier are likely to cancel out in the aggregation step. Second, we use the automatically labeled data for supervised training of a neural network to predict reply sentiment from the original tweets. The resulting classifier is evaluated on the new RETWEET dataset, showing promising results, especially considering that it has been trained without any manually labeled data. Both the dataset and the baseline implementation are publicly available.
△ Less
Submitted 21 April, 2021;
originally announced April 2021.
-
Machine Learning-Based Generalized Model for Finite Element Analysis of Roll Deflection During the Austenitic Stainless Steel 316L Strip Rolling
Authors:
Mahshad Lotfinia,
Soroosh Tayebi Arasteh
Abstract:
During the strip rolling process, a considerable amount of the forces of the material pressure cause elastic deformation on the work-roll, i.e., the deflection process. The uncontrollable amount of the work-roll deflection leads to the high deviations in the permissible thickness of the plate along its width. In the context of the Austenitic Stainless Steels (ASS), due to the instability of the Au…
▽ More
During the strip rolling process, a considerable amount of the forces of the material pressure cause elastic deformation on the work-roll, i.e., the deflection process. The uncontrollable amount of the work-roll deflection leads to the high deviations in the permissible thickness of the plate along its width. In the context of the Austenitic Stainless Steels (ASS), due to the instability of the Austenite phase in a cold temperature, cold deformation leads to the production of Strain-Induced Martensite (SIM), which improves the mechanical properties. It leads to the hardening of the ASS 316L during the cold deformation, which causes the Strain-Stress curve of the ASS 316L to behave non-linearly, which distinguishes it from other categories of steels. To account for this phenomenon, we propose to utilize a Machine Learning (ML) method to predict more accurately the flow stress of the ASS 316L during the cold rolling. Furthermore, we conduct various mechanical tensile tests in order to obtain the required dataset, Stress316L, for training the neural network. Moreover, instead of using a constant value of flow stress during the multi-pass rolling process, we use a Finite Difference (FD) formulation of the equilibrium equation in order to account for the dynamic behavior of the flow stress, which leads to the estimation of the mean pressure, which the strip enforces to the rolls during deformation. Finally, using the Finite Element Analysis (FEA), the deflection of the work-roll tools will be calculated. As a result, we end up with a generalized model for the calculation of the roll deflection, specific to the ASS 316L. To the best of our knowledge, this is the first model for ASS 316L which considers dynamic flow stress and SIM of the rolled plate, using FEM and an ML approach, which could contribute to the better design of the tolls.
△ Less
Submitted 24 April, 2022; v1 submitted 4 February, 2021;
originally announced February 2021.