-
Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare
Authors:
Lovedeep Gondara,
Jonathan Simkin,
Graham Sayle,
Shebnum Devji,
Gregory Arbour,
Raymond Ng
Abstract:
This study aims to guide language model selection by investigating: 1) the necessity of finetuning versus zero-shot usage, 2) the benefits of domain-adjacent versus generic pretrained models, 3) the value of further domain-specific pretraining, and 4) the continued relevance of Small Language Models (SLMs) compared to Large Language Models (LLMs) for specific tasks. Using electronic pathology repo…
▽ More
This study aims to guide language model selection by investigating: 1) the necessity of finetuning versus zero-shot usage, 2) the benefits of domain-adjacent versus generic pretrained models, 3) the value of further domain-specific pretraining, and 4) the continued relevance of Small Language Models (SLMs) compared to Large Language Models (LLMs) for specific tasks. Using electronic pathology reports from the British Columbia Cancer Registry (BCCR), three classification scenarios with varying difficulty and data size are evaluated. Models include various SLMs and an LLM. SLMs are evaluated both zero-shot and finetuned; the LLM is evaluated zero-shot only. Finetuning significantly improved SLM performance across all scenarios compared to their zero-shot results. The zero-shot LLM outperformed zero-shot SLMs but was consistently outperformed by finetuned SLMs. Domain-adjacent SLMs generally performed better than the generic SLM after finetuning, especially on harder tasks. Further domain-specific pretraining yielded modest gains on easier tasks but significant improvements on the complex, data-scarce task. The results highlight the critical role of finetuning for SLMs in specialized domains, enabling them to surpass zero-shot LLM performance on targeted classification tasks. Pretraining on domain-adjacent or domain-specific data provides further advantages, particularly for complex problems or limited finetuning data. While LLMs offer strong zero-shot capabilities, their performance on these specific tasks did not match that of appropriately finetuned SLMs. In the era of LLMs, SLMs remain relevant and effective, offering a potentially superior performance-resource trade-off compared to LLMs.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Leveraging Language Models for Automated Patient Record Linkage
Authors:
Mohammad Beheshti,
Lovedeep Gondara,
Iris Zachary
Abstract:
Objective: Healthcare data fragmentation presents a major challenge for linking patient data, necessitating robust record linkage to integrate patient records from diverse sources. This study investigates the feasibility of leveraging language models for automated patient record linkage, focusing on two key tasks: blocking and matching. Materials and Methods: We utilized real-world healthcare data…
▽ More
Objective: Healthcare data fragmentation presents a major challenge for linking patient data, necessitating robust record linkage to integrate patient records from diverse sources. This study investigates the feasibility of leveraging language models for automated patient record linkage, focusing on two key tasks: blocking and matching. Materials and Methods: We utilized real-world healthcare data from the Missouri Cancer Registry and Research Center, linking patient records from two independent sources using probabilistic linkage as a baseline. A transformer-based model, RoBERTa, was fine-tuned for blocking using sentence embeddings. For matching, several language models were experimented under fine-tuned and zero-shot settings, assessing their performance against ground truth labels. Results: The fine-tuned blocking model achieved a 92% reduction in the number of candidate pairs while maintaining near-perfect recall. In the matching task, fine-tuned Mistral-7B achieved the best performance with only 6 incorrect predictions. Among zero-shot models, Mistral-Small-24B performed best, with a total of 55 incorrect predictions. Discussion: Fine-tuned language models achieved strong performance in patient record blocking and matching with minimal errors. However, they remain less accurate and efficient than a hybrid rule-based and probabilistic approach for blocking. Additionally, reasoning models like DeepSeek-R1 are impractical for large-scale record linkage due to high computational costs. Conclusion: This study highlights the potential of language models for automating patient record linkage, offering improved efficiency by eliminating the manual efforts required to perform patient record linkage. Overall, language models offer a scalable solution that can enhance data integration, reduce manual effort, and support disease surveillance and research.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
ELM: Ensemble of Language Models for Predicting Tumor Group from Pathology Reports
Authors:
Lovedeep Gondara,
Jonathan Simkin,
Shebnum Devji,
Gregory Arbour,
Raymond Ng
Abstract:
Population-based cancer registries (PBCRs) face a significant bottleneck in manually extracting data from unstructured pathology reports, a process crucial for tasks like tumor group assignment, which can consume 900 person-hours for approximately 100,000 reports. To address this, we introduce ELM (Ensemble of Language Models), a novel ensemble-based approach leveraging both small language models…
▽ More
Population-based cancer registries (PBCRs) face a significant bottleneck in manually extracting data from unstructured pathology reports, a process crucial for tasks like tumor group assignment, which can consume 900 person-hours for approximately 100,000 reports. To address this, we introduce ELM (Ensemble of Language Models), a novel ensemble-based approach leveraging both small language models (SLMs) and large language models (LLMs). ELM utilizes six fine-tuned SLMs, where three SLMs use the top part of the pathology report and three SLMs use the bottom part. This is done to maximize report coverage. ELM requires five-out-of-six agreement for a tumor group classification. Disagreements are arbitrated by an LLM with a carefully curated prompt. Our evaluation across nineteen tumor groups demonstrates ELM achieves an average precision and recall of 0.94, outperforming single-model and ensemble-without-LLM approaches. Deployed at the British Columbia Cancer Registry, ELM demonstrates how LLMs can be successfully applied in a PBCR setting to achieve state-of-the-art results and significantly enhance operational efficiencies, saving hundreds of person-hours annually.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
A Clinical Trial Design Approach to Auditing Language Models in Healthcare Setting
Authors:
Lovedeep Gondara,
Jonathan Simkin
Abstract:
We present an audit mechanism for language models, with a focus on models deployed in the healthcare setting. Our proposed mechanism takes inspiration from clinical trial design where we posit the language model audit as a single blind equivalence trial, with the comparison of interest being the subject matter experts. We show that using our proposed method, we can follow principled sample size an…
▽ More
We present an audit mechanism for language models, with a focus on models deployed in the healthcare setting. Our proposed mechanism takes inspiration from clinical trial design where we posit the language model audit as a single blind equivalence trial, with the comparison of interest being the subject matter experts. We show that using our proposed method, we can follow principled sample size and power calculations, leading to the requirement of sampling minimum number of records while maintaining the audit integrity and statistical soundness. Finally, we provide a real-world example of the audit used in a production environment in a large-scale public health network.
△ Less
Submitted 18 December, 2024; v1 submitted 11 November, 2024;
originally announced November 2024.
-
Differentially Private Ensemble Classifiers for Data Streams
Authors:
Lovedeep Gondara,
Ke Wang,
Ricardo Silva Carvalho
Abstract:
Learning from continuous data streams via classification/regression is prevalent in many domains. Adapting to evolving data characteristics (concept drift) while protecting data owners' private information is an open challenge. We present a differentially private ensemble solution to this problem with two distinguishing features: it allows an \textit{unbounded} number of ensemble updates to deal w…
▽ More
Learning from continuous data streams via classification/regression is prevalent in many domains. Adapting to evolving data characteristics (concept drift) while protecting data owners' private information is an open challenge. We present a differentially private ensemble solution to this problem with two distinguishing features: it allows an \textit{unbounded} number of ensemble updates to deal with the potentially never-ending data streams under a fixed privacy budget, and it is \textit{model agnostic}, in that it treats any pre-trained differentially private classification/regression model as a black-box. Our method outperforms competitors on real-world and simulated datasets for varying settings of privacy, concept drift, and data distribution.
△ Less
Submitted 8 December, 2021;
originally announced December 2021.
-
The Differentially Private Lottery Ticket Mechanism
Authors:
Lovedeep Gondara,
Ke Wang,
Ricardo Silva Carvalho
Abstract:
We propose the differentially private lottery ticket mechanism (DPLTM). An end-to-end differentially private training paradigm based on the lottery ticket hypothesis. Using "high-quality winners", selected via our custom score function, DPLTM significantly improves the privacy-utility trade-off over the state-of-the-art. We show that DPLTM converges faster, allowing for early stopping with reduced…
▽ More
We propose the differentially private lottery ticket mechanism (DPLTM). An end-to-end differentially private training paradigm based on the lottery ticket hypothesis. Using "high-quality winners", selected via our custom score function, DPLTM significantly improves the privacy-utility trade-off over the state-of-the-art. We show that DPLTM converges faster, allowing for early stopping with reduced privacy budget consumption. We further show that the tickets from DPLTM are transferable across datasets, domains, and architectures. Our extensive evaluation on several public datasets provides evidence to our claims.
△ Less
Submitted 16 February, 2020;
originally announced February 2020.
-
Differentially Private Survival Function Estimation
Authors:
Lovedeep Gondara,
Ke Wang
Abstract:
Survival function estimation is used in many disciplines, but it is most common in medical analytics in the form of the Kaplan-Meier estimator. Sensitive data (patient records) is used in the estimation without any explicit control on the information leakage, which is a significant privacy concern. We propose a first differentially private estimator of the survival function and show that it can be…
▽ More
Survival function estimation is used in many disciplines, but it is most common in medical analytics in the form of the Kaplan-Meier estimator. Sensitive data (patient records) is used in the estimation without any explicit control on the information leakage, which is a significant privacy concern. We propose a first differentially private estimator of the survival function and show that it can be easily extended to provide differentially private confidence intervals and test statistics without spending any extra privacy budget. We further provide extensions for differentially private estimation of the competing risk cumulative incidence function, Nelson-Aalen's estimator for the hazard function, etc. Using eleven real-life clinical datasets, we provide empirical evidence that our proposed method provides good utility while simultaneously providing strong privacy guarantees.
△ Less
Submitted 14 January, 2020; v1 submitted 4 October, 2019;
originally announced October 2019.
-
Recovering Loss to Followup Information Using Denoising Autoencoders
Authors:
Lovedeep Gondara,
Ke Wang
Abstract:
Loss to followup is a significant issue in healthcare and has serious consequences for a study's validity and cost. Methods available at present for recovering loss to followup information are restricted by their expressive capabilities and struggle to model highly non-linear relations and complex interactions. In this paper we propose a model based on overcomplete denoising autoencoders to recove…
▽ More
Loss to followup is a significant issue in healthcare and has serious consequences for a study's validity and cost. Methods available at present for recovering loss to followup information are restricted by their expressive capabilities and struggle to model highly non-linear relations and complex interactions. In this paper we propose a model based on overcomplete denoising autoencoders to recover loss to followup information. Designed to work with high volume data, results on various simulated and real life datasets show our model is appropriate under varying dataset and loss to followup conditions and outperforms the state-of-the-art methods by a wide margin ($\ge 20\%$ in some scenarios) while preserving the dataset utility for final analysis.
△ Less
Submitted 11 February, 2018;
originally announced February 2018.
-
MIDA: Multiple Imputation using Denoising Autoencoders
Authors:
Lovedeep Gondara,
Ke Wang
Abstract:
Missing data is a significant problem impacting all domains. State-of-the-art framework for minimizing missing data bias is multiple imputation, for which the choice of an imputation model remains nontrivial. We propose a multiple imputation model based on overcomplete deep denoising autoencoders. Our proposed model is capable of handling different data types, missingness patterns, missingness pro…
▽ More
Missing data is a significant problem impacting all domains. State-of-the-art framework for minimizing missing data bias is multiple imputation, for which the choice of an imputation model remains nontrivial. We propose a multiple imputation model based on overcomplete deep denoising autoencoders. Our proposed model is capable of handling different data types, missingness patterns, missingness proportions and distributions. Evaluation on several real life datasets show our proposed model significantly outperforms current state-of-the-art methods under varying conditions while simultaneously improving end of the line analytics.
△ Less
Submitted 17 February, 2018; v1 submitted 8 May, 2017;
originally announced May 2017.
-
Detecting Adversarial Samples Using Density Ratio Estimates
Authors:
Lovedeep Gondara
Abstract:
Machine learning models, especially based on deep architectures are used in everyday applications ranging from self driving cars to medical diagnostics. It has been shown that such models are dangerously susceptible to adversarial samples, indistinguishable from real samples to human eye, adversarial samples lead to incorrect classifications with high confidence. Impact of adversarial samples is f…
▽ More
Machine learning models, especially based on deep architectures are used in everyday applications ranging from self driving cars to medical diagnostics. It has been shown that such models are dangerously susceptible to adversarial samples, indistinguishable from real samples to human eye, adversarial samples lead to incorrect classifications with high confidence. Impact of adversarial samples is far-reaching and their efficient detection remains an open problem. We propose to use direct density ratio estimation as an efficient model agnostic measure to detect adversarial samples. Our proposed method works equally well with single and multi-channel samples, and with different adversarial sample generation methods. We also propose a method to use density ratio estimates for generating adversarial samples with an added constraint of preserving density ratio.
△ Less
Submitted 20 November, 2017; v1 submitted 5 May, 2017;
originally announced May 2017.
-
CrowdMI: Multiple Imputation via Crowdsourcing
Authors:
Lovedeep Gondara
Abstract:
Can humans impute missing data with similar proficiency as machines? This is the question we aim to answer in this paper. We present a novel idea of converting observations with missing data in to a survey questionnaire, which is presented to crowdworkers for completion. We replicate a multiple imputation framework by having multiple unique crowdworkers complete our questionnaire. Experimental res…
▽ More
Can humans impute missing data with similar proficiency as machines? This is the question we aim to answer in this paper. We present a novel idea of converting observations with missing data in to a survey questionnaire, which is presented to crowdworkers for completion. We replicate a multiple imputation framework by having multiple unique crowdworkers complete our questionnaire. Experimental results demonstrate that using our method, it is possible to generate valid imputations for qualitative and quantitative missing data, with results comparable to imputations generated by complex statistical models.
△ Less
Submitted 23 February, 2018; v1 submitted 8 December, 2016;
originally announced December 2016.
-
Classifier comparison using precision
Authors:
Lovedeep Gondara
Abstract:
New proposed models are often compared to state-of-the-art using statistical significance testing. Literature is scarce for classifier comparison using metrics other than accuracy. We present a survey of statistical methods that can be used for classifier comparison using precision, accounting for inter-precision correlation arising from use of same dataset. Comparisons are made using per-class pr…
▽ More
New proposed models are often compared to state-of-the-art using statistical significance testing. Literature is scarce for classifier comparison using metrics other than accuracy. We present a survey of statistical methods that can be used for classifier comparison using precision, accounting for inter-precision correlation arising from use of same dataset. Comparisons are made using per-class precision and methods presented to test global null hypothesis of an overall model comparison. Comparisons are extended to multiple multi-class classifiers and to models using cross validation or its variants. Partial Bayesian update to precision is introduced when population prevalence of a class is known. Applications to compare deep architectures are studied.
△ Less
Submitted 15 November, 2016; v1 submitted 29 September, 2016;
originally announced September 2016.
-
Medical image denoising using convolutional denoising autoencoders
Authors:
Lovedeep Gondara
Abstract:
Image denoising is an important pre-processing step in medical image analysis. Different algorithms have been proposed in past three decades with varying denoising performances. More recently, having outperformed all conventional methods, deep learning based models have shown a great promise. These methods are however limited for requirement of large training sample size and high computational cos…
▽ More
Image denoising is an important pre-processing step in medical image analysis. Different algorithms have been proposed in past three decades with varying denoising performances. More recently, having outperformed all conventional methods, deep learning based models have shown a great promise. These methods are however limited for requirement of large training sample size and high computational costs. In this paper we show that using small sample size, denoising autoencoders constructed using convolutional layers can be used for efficient denoising of medical images. Heterogeneous images can be combined to boost sample size for increased denoising performance. Simplest of networks can reconstruct images with corruption levels so high that noise and signal are not differentiable to human eye.
△ Less
Submitted 17 September, 2016; v1 submitted 16 August, 2016;
originally announced August 2016.