Search | arXiv e-print repository

Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models

Authors: Hao Zhou, Guergana Savova, Lijing Wang

Abstract: The impact of random seeds in fine-tuning large language models (LLMs) has been largely overlooked despite its potential influence on model performance.In this study, we systematically evaluate the effects of random seeds on LLMs using the GLUE and SuperGLUE benchmarks. We analyze the macro-level impact through traditional metrics like accuracy and F1, calculating their mean and variance to quanti… ▽ More The impact of random seeds in fine-tuning large language models (LLMs) has been largely overlooked despite its potential influence on model performance.In this study, we systematically evaluate the effects of random seeds on LLMs using the GLUE and SuperGLUE benchmarks. We analyze the macro-level impact through traditional metrics like accuracy and F1, calculating their mean and variance to quantify performance fluctuations. To capture the micro-level effects, we introduce a novel metric, consistency, measuring the stability of individual predictions across runs. Our experiments reveal significant variance at both macro and micro levels, underscoring the need for careful consideration of random seeds in fine-tuning and evaluation. △ Less

Submitted 10 March, 2025; originally announced March 2025.

Comments: 7 pages, 5 tables, 3 figures

arXiv:2502.10388 [pdf, other]

Aspect-Oriented Summarization for Psychiatric Short-Term Readmission Prediction

Authors: WonJin Yoon, Boyu Ren, Spencer Thomas, Chanwhi Kim, Guergana Savova, Mei-Hua Hall, Timothy Miller

Abstract: Recent progress in large language models (LLMs) has enabled the automated processing of lengthy documents even without supervised training on a task-specific dataset. Yet, their zero-shot performance in complex tasks as opposed to straightforward information extraction tasks remains suboptimal. One feasible approach for tasks with lengthy, complex input is to first summarize the document and then… ▽ More Recent progress in large language models (LLMs) has enabled the automated processing of lengthy documents even without supervised training on a task-specific dataset. Yet, their zero-shot performance in complex tasks as opposed to straightforward information extraction tasks remains suboptimal. One feasible approach for tasks with lengthy, complex input is to first summarize the document and then apply supervised fine-tuning to the summary. However, the summarization process inevitably results in some loss of information. In this study we present a method for processing the summaries of long documents aimed to capture different important aspects of the original document. We hypothesize that LLM summaries generated with different aspect-oriented prompts contain different \textit{information signals}, and we propose methods to measure these differences. We introduce approaches to effectively integrate signals from these different summaries for supervised training of transformer models. We validate our hypotheses on a high-impact task -- 30-day readmission prediction from a psychiatric discharge -- using real-world data from four hospitals, and show that our proposed method increases the prediction performance for the complex task of predicting patient outcome. △ Less

Submitted 14 February, 2025; originally announced February 2025.

arXiv:2410.12774 [pdf, other]

Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information

Authors: Yingya Li, Timothy Miller, Steven Bethard, Guergana Savova

Abstract: The success of multi-task learning can depend heavily on which tasks are grouped together. Naively grouping all tasks or a random set of tasks can result in negative transfer, with the multi-task models performing worse than single-task models. Though many efforts have been made to identify task groupings and to measure the relatedness among different tasks, it remains a challenging research topic… ▽ More The success of multi-task learning can depend heavily on which tasks are grouped together. Naively grouping all tasks or a random set of tasks can result in negative transfer, with the multi-task models performing worse than single-task models. Though many efforts have been made to identify task groupings and to measure the relatedness among different tasks, it remains a challenging research topic to define a metric to identify the best task grouping out of a pool of many potential task combinations. We propose a metric of task relatedness based on task difficulty measured by pointwise V-usable information (PVI). PVI is a recently proposed metric to estimate how much usable information a dataset contains given a model. We hypothesize that tasks with not statistically different PVI estimates are similar enough to benefit from the joint learning process. We conduct comprehensive experiments to evaluate the feasibility of this metric for task grouping on 15 NLP datasets in the general, biomedical, and clinical domains. We compare the results of the joint learners against single learners, existing baseline methods, and recent large language models, including Llama 2 and GPT-4. The results show that by grouping tasks with similar PVI estimates, the joint learners yielded competitive results with fewer total parameters, with consistent performance across domains. △ Less

Submitted 16 October, 2024; originally announced October 2024.

Comments: main paper 12 pages, Appendix 7 pages, 1 figure, 18 tables

arXiv:2410.12722 [pdf, other]

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation

Authors: João Matos, Shan Chen, Siena Placino, Yingya Li, Juan Carlos Climent Pardo, Daphna Idan, Takeshi Tohyama, David Restrepo, Luis F. Nakayama, Jose M. M. Pascual-Leone, Guergana Savova, Hugo Aerts, Leo A. Celi, A. Ian Wong, Danielle S. Bitterman, Jack Gallifant

Abstract: Multimodal/vision language models (VLMs) are increasingly being deployed in healthcare settings worldwide, necessitating robust benchmarks to ensure their safety, efficacy, and fairness. Multiple-choice question and answer (QA) datasets derived from national medical examinations have long served as valuable evaluation tools, but existing datasets are largely text-only and available in a limited su… ▽ More Multimodal/vision language models (VLMs) are increasingly being deployed in healthcare settings worldwide, necessitating robust benchmarks to ensure their safety, efficacy, and fairness. Multiple-choice question and answer (QA) datasets derived from national medical examinations have long served as valuable evaluation tools, but existing datasets are largely text-only and available in a limited subset of languages and countries. To address these challenges, we present WorldMedQA-V, an updated multilingual, multimodal benchmarking dataset designed to evaluate VLMs in healthcare. WorldMedQA-V includes 568 labeled multiple-choice QAs paired with 568 medical images from four countries (Brazil, Israel, Japan, and Spain), covering original languages and validated English translations by native clinicians, respectively. Baseline performance for common open- and closed-source models are provided in the local language and English translations, and with and without images provided to the model. The WorldMedQA-V benchmark aims to better match AI systems to the diverse healthcare environments in which they are deployed, fostering more equitable, effective, and representative applications. △ Less

Submitted 16 October, 2024; originally announced October 2024.

Comments: submitted for review, total of 14 pages

arXiv:2410.09937 [pdf]

Artificial Intelligence in the Legal Field: Law Students Perspective

Authors: Daniela Andreeva, Guergana Savova

Abstract: The Artificial Intelligence field, or AI, experienced a renaissance in the last few years across various fields such as law, medicine, and finance. While there are studies outlining the landscape of AI in the legal field as well as surveys of the current AI efforts of law firms, to our knowledge there has not been an investigation of the intersection of law students and AI. Such research is critic… ▽ More The Artificial Intelligence field, or AI, experienced a renaissance in the last few years across various fields such as law, medicine, and finance. While there are studies outlining the landscape of AI in the legal field as well as surveys of the current AI efforts of law firms, to our knowledge there has not been an investigation of the intersection of law students and AI. Such research is critical to help ensure current law students are positioned to fully exploit this technology as they embark on their legal careers but to also assist existing legal firms to better leverage their AI skillset both operationally and in helping to formulate future legal frameworks for regulating this technology across industries. The study presented in this paper addresses this gap. Through a survey conducted from July 22 to Aug 19, 2024, the study covers the law students background, AI usage, AI applications in the legal field, AI regulations and open-ended comments to share opinions. The results from this study show the uniqueness of law students as a distinct cohort. The results differ from the ones of established law firms especially in AI engagement - established legal professionals are more engaged than law students. Somewhat surprising, the law firm participants show higher enthusiasm about AI than this student cohort. Collaborations with Computer Science departments would further enhance the AI knowledge and experience of law students in AI technologies such as prompt engineering (zero and few shot), chain-of-thought prompting, and language model hallucination management. As future work, we would like to expand the study to include more variables and a larger cohort more evenly distributed across locales. In addition, it would be insightful to repeat the study with the current cohort in one year to track how the students viewpoints evolve. △ Less

Submitted 13 October, 2024; originally announced October 2024.

Comments: main paper 11 pages, Appendix 5 pages, 1 table

arXiv:2405.09153 [pdf, other]

Adapting Abstract Meaning Representation Parsing to the Clinical Narrative -- the SPRING THYME parser

Authors: Jon Z. Cai, Kristin Wright-Bettner, Martha Palmer, Guergana K. Savova, James H. Martin

Abstract: This paper is dedicated to the design and evaluation of the first AMR parser tailored for clinical notes. Our objective was to facilitate the precise transformation of the clinical notes into structured AMR expressions, thereby enhancing the interpretability and usability of clinical text data at scale. Leveraging the colon cancer dataset from the Temporal Histories of Your Medical Events (THYME)… ▽ More This paper is dedicated to the design and evaluation of the first AMR parser tailored for clinical notes. Our objective was to facilitate the precise transformation of the clinical notes into structured AMR expressions, thereby enhancing the interpretability and usability of clinical text data at scale. Leveraging the colon cancer dataset from the Temporal Histories of Your Medical Events (THYME) corpus, we adapted a state-of-the-art AMR parser utilizing continuous training. Our approach incorporates data augmentation techniques to enhance the accuracy of AMR structure predictions. Notably, through this learning strategy, our parser achieved an impressive F1 score of 88% on the THYME corpus's colon cancer dataset. Moreover, our research delved into the efficacy of data required for domain adaptation within the realm of clinical notes, presenting domain adaptation data requirements for AMR parsing. This exploration not only underscores the parser's robust performance but also highlights its potential in facilitating a deeper understanding of clinical narratives through structured semantic representations. △ Less

Submitted 15 May, 2024; originally announced May 2024.

Comments: Accepted to the 6th Clinical NLP Workshop at NAACL, 2024

arXiv:2310.17703 [pdf]

The impact of responding to patient messages with large language model assistance

Authors: Shan Chen, Marco Guevara, Shalini Moningi, Frank Hoebers, Hesham Elhalawani, Benjamin H. Kann, Fallon E. Chipidza, Jonathan Leeman, Hugo J. W. L. Aerts, Timothy Miller, Guergana K. Savova, Raymond H. Mak, Maryam Lustberg, Majid Afshar, Danielle S. Bitterman

Abstract: Documentation burden is a major contributor to clinician burnout, which is rising nationally and is an urgent threat to our ability to care for patients. Artificial intelligence (AI) chatbots, such as ChatGPT, could reduce clinician burden by assisting with documentation. Although many hospitals are actively integrating such systems into electronic medical record systems, AI chatbots utility and i… ▽ More Documentation burden is a major contributor to clinician burnout, which is rising nationally and is an urgent threat to our ability to care for patients. Artificial intelligence (AI) chatbots, such as ChatGPT, could reduce clinician burden by assisting with documentation. Although many hospitals are actively integrating such systems into electronic medical record systems, AI chatbots utility and impact on clinical decision-making have not been studied for this intended use. We are the first to examine the utility of large language models in assisting clinicians draft responses to patient questions. In our two-stage cross-sectional study, 6 oncologists responded to 100 realistic synthetic cancer patient scenarios and portal messages developed to reflect common medical situations, first manually, then with AI assistance. We find AI-assisted responses were longer, less readable, but provided acceptable drafts without edits 58% of time. AI assistance improved efficiency 77% of time, with low harm risk (82% safe). However, 7.7% unedited AI responses could severely harm. In 31% cases, physicians thought AI drafts were human-written. AI assistance led to more patient education recommendations, fewer clinical actions than manual responses. Results show promise for AI to improve clinician efficiency and patient care through assisting documentation, if used judiciously. Monitoring model outputs and human-AI interaction remains crucial for safe implementation. △ Less

Submitted 29 November, 2023; v1 submitted 26 October, 2023; originally announced October 2023.

Comments: 4 figures and tables in main, submitted for review

arXiv:2310.12300 [pdf, other]

Measuring Pointwise $\mathcal{V}$-Usable Information In-Context-ly

Authors: Sheng Lu, Shan Chen, Yingya Li, Danielle Bitterman, Guergana Savova, Iryna Gurevych

Abstract: In-context learning (ICL) is a new learning paradigm that has gained popularity along with the development of large language models. In this work, we adapt a recently proposed hardness metric, pointwise $\mathcal{V}$-usable information (PVI), to an in-context version (in-context PVI). Compared to the original PVI, in-context PVI is more efficient in that it requires only a few exemplars and does n… ▽ More In-context learning (ICL) is a new learning paradigm that has gained popularity along with the development of large language models. In this work, we adapt a recently proposed hardness metric, pointwise $\mathcal{V}$-usable information (PVI), to an in-context version (in-context PVI). Compared to the original PVI, in-context PVI is more efficient in that it requires only a few exemplars and does not require fine-tuning. We conducted a comprehensive empirical analysis to evaluate the reliability of in-context PVI. Our findings indicate that in-context PVI estimates exhibit similar characteristics to the original PVI. Specific to the in-context setting, we show that in-context PVI estimates remain consistent across different exemplar selections and numbers of shots. The variance of in-context PVI estimates across different exemplar selections is insignificant, which suggests that in-context PVI are stable. Furthermore, we demonstrate how in-context PVI can be employed to identify challenging instances. Our work highlights the potential of in-context PVI and provides new insights into the capabilities of ICL. △ Less

Submitted 8 December, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

Comments: EMNLP 2023 Findings

arXiv:2308.06354 [pdf]

doi 10.1038/s41746-023-00970-0.

Large Language Models to Identify Social Determinants of Health in Electronic Health Records

Authors: Marco Guevara, Shan Chen, Spencer Thomas, Tafadzwa L. Chaunzwa, Idalid Franco, Benjamin Kann, Shalini Moningi, Jack Qian, Madeleine Goldstein, Susan Harper, Hugo JWL Aerts, Guergana K. Savova, Raymond H. Mak, Danielle S. Bitterman

Abstract: Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHR). This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documente… ▽ More Social determinants of health (SDoH) have an important impact on patient outcomes but are incompletely collected from the electronic health records (EHR). This study researched the ability of large language models to extract SDoH from free text in EHRs, where they are most commonly documented, and explored the role of synthetic clinical text for improving the extraction of these scarcely documented, yet extremely valuable, clinical data. 800 patient notes were annotated for SDoH categories, and several transformer-based models were evaluated. The study also experimented with synthetic data generation and assessed for algorithmic bias. Our best-performing models were fine-tuned Flan-T5 XL (macro-F1 0.71) for any SDoH, and Flan-T5 XXL (macro-F1 0.70). The benefit of augmenting fine-tuning with synthetic data varied across model architecture and size, with smaller Flan-T5 models (base and large) showing the greatest improvements in performance (delta F1 +0.12 to +0.23). Model performance was similar on the in-hospital system dataset but worse on the MIMIC-III dataset. Our best-performing fine-tuned models outperformed zero- and few-shot performance of ChatGPT-family models for both tasks. These fine-tuned models were less likely than ChatGPT to change their prediction when race/ethnicity and gender descriptors were added to the text, suggesting less algorithmic bias (p<0.05). At the patient-level, our models identified 93.8% of patients with adverse SDoH, while ICD-10 codes captured 2.0%. Our method can effectively extracted SDoH information from clinic notes, performing better compare to GPT zero- and few-shot settings. These models could enhance real-world evidence on SDoH and aid in identifying patients needing social support. △ Less

Submitted 5 March, 2024; v1 submitted 11 August, 2023; originally announced August 2023.

Comments: Peer-reviewed version published at NPJ Digital Medicine: https://www.nature.com/articles/s41746-023-00970-0

Journal ref: NPJ Digit Med. 2024 Jan 11;7(1):6

arXiv:2304.02496 [pdf]

doi 10.1093/jamia/ocad256

Evaluation of ChatGPT Family of Models for Biomedical Reasoning and Classification

Authors: Shan Chen, Yingya Li, Sheng Lu, Hoang Van, Hugo JWL Aerts, Guergana K. Savova, Danielle S. Bitterman

Abstract: Recent advances in large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates the performance of LLMs such as the ChatGPT family of models (GPT-3.5s, GPT-4) in biomedical tasks beyond question-answering. Because no patient data can be passed to the OpenAI A… ▽ More Recent advances in large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates the performance of LLMs such as the ChatGPT family of models (GPT-3.5s, GPT-4) in biomedical tasks beyond question-answering. Because no patient data can be passed to the OpenAI API public interface, we evaluated model performance with over 10000 samples as proxies for two fundamental tasks in the clinical domain - classification and reasoning. The first task is classifying whether statements of clinical and policy recommendations in scientific literature constitute health advice. The second task is causal relation detection from the biomedical literature. We compared LLMs with simpler models, such as bag-of-words (BoW) with logistic regression, and fine-tuned BioBERT models. Despite the excitement around viral ChatGPT, we found that fine-tuning for two fundamental NLP tasks remained the best strategy. The simple BoW model performed on par with the most complex LLM prompting. Prompt engineering required significant investment. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: 28 pages, 2 tables and 4 figures. Submitting for review

arXiv:2303.13722 [pdf]

doi 10.1200/CCI.23.00048

Natural language processing to automatically extract the presence and severity of esophagitis in notes of patients undergoing radiotherapy

Authors: Shan Chen, Marco Guevara, Nicolas Ramirez, Arpi Murray, Jeremy L. Warner, Hugo JWL Aerts, Timothy A. Miller, Guergana K. Savova, Raymond H. Mak, Danielle S. Bitterman

Abstract: Radiotherapy (RT) toxicities can impair survival and quality-of-life, yet remain under-studied. Real-world evidence holds potential to improve our understanding of toxicities, but toxicity information is often only in clinical notes. We developed natural language processing (NLP) models to identify the presence and severity of esophagitis from notes of patients treated with thoracic RT. We fine-tu… ▽ More Radiotherapy (RT) toxicities can impair survival and quality-of-life, yet remain under-studied. Real-world evidence holds potential to improve our understanding of toxicities, but toxicity information is often only in clinical notes. We developed natural language processing (NLP) models to identify the presence and severity of esophagitis from notes of patients treated with thoracic RT. We fine-tuned statistical and pre-trained BERT-based models for three esophagitis classification tasks: Task 1) presence of esophagitis, Task 2) severe esophagitis or not, and Task 3) no esophagitis vs. grade 1 vs. grade 2-3. Transferability was tested on 345 notes from patients with esophageal cancer undergoing RT. Fine-tuning PubmedBERT yielded the best performance. The best macro-F1 was 0.92, 0.82, and 0.74 for Task 1, 2, and 3, respectively. Selecting the most informative note sections during fine-tuning improved macro-F1 by over 2% for all tasks. Silver-labeled data improved the macro-F1 by over 3% across all tasks. For the esophageal cancer notes, the best macro-F1 was 0.73, 0.74, and 0.65 for Task 1, 2, and 3, respectively, without additional fine-tuning. To our knowledge, this is the first effort to automatically extract esophagitis toxicity severity according to CTCAE guidelines from clinic notes. The promising performance provides proof-of-concept for NLP-based automated detailed toxicity monitoring in expanded domains. △ Less

Submitted 23 March, 2023; originally announced March 2023.

Comments: 17 pages, 6 tables, 1figure, submiting to JCO-CCI for review

arXiv:2006.13737 [pdf]

Diagnosis Prevalence vs. Efficacy in Machine-learning Based Diagnostic Decision Support

Authors: Gil Alon, Elizabeth Chen, Guergana Savova, Carsten Eickhoff

Abstract: Many recent studies use machine learning to predict a small number of ICD-9-CM codes. In practice, on the other hand, physicians have to consider a broader range of diagnoses. This study aims to put these previously incongruent evaluation settings on a more equal footing by predicting ICD-9-CM codes based on electronic health record properties and demonstrating the relationship between diagnosis p… ▽ More Many recent studies use machine learning to predict a small number of ICD-9-CM codes. In practice, on the other hand, physicians have to consider a broader range of diagnoses. This study aims to put these previously incongruent evaluation settings on a more equal footing by predicting ICD-9-CM codes based on electronic health record properties and demonstrating the relationship between diagnosis prevalence and system performance. We extracted patient features from the MIMIC-III dataset for each admission. We trained and evaluated 43 different machine learning classifiers. Among this pool, the most successful classifier was a Multi-Layer Perceptron. In accordance with general machine learning expectation, we observed all classifiers' F1 scores to drop as disease prevalence decreased. Scores fell from 0.28 for the 50 most prevalent ICD-9-CM codes to 0.03 for the 1000 most prevalent ICD-9-CM codes. Statistical analyses showed a moderate positive correlation between disease prevalence and efficacy (0.5866). △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: AMIA Joint Summits in Translational Science, 2020

arXiv:2006.13721 [pdf]

Mining Misdiagnosis Patterns from Biomedical Literature

Authors: Cindy Li, Elizabeth Chen, Guergana Savova, Hamish Fraser, Carsten Eickhoff

Abstract: Diagnostic errors can pose a serious threat to patient safety, leading to serious harm and even death. Efforts are being made to develop interventions that allow physicians to reassess for errors and improve diagnostic accuracy. Our study presents an exploration of misdiagnosis patterns mined from PubMed abstracts. Article titles containing certain phrases indicating misdiagnosis were selected and… ▽ More Diagnostic errors can pose a serious threat to patient safety, leading to serious harm and even death. Efforts are being made to develop interventions that allow physicians to reassess for errors and improve diagnostic accuracy. Our study presents an exploration of misdiagnosis patterns mined from PubMed abstracts. Article titles containing certain phrases indicating misdiagnosis were selected and frequencies of these misdiagnoses calculated. We present the resulting patterns in the form of a directed graph with frequency-weighted misdiagnosis edges connecting diagnosis vertices. We find that the most commonly misdiagnosed diseases were often misdiagnosed as many different diseases, with each misdiagnosis having a relatively low frequency, rather than as a single disease with greater probability. Additionally, while a misdiagnosis relationship may generally exist, the relationship was often found to be one-sided. △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: AMIA Joint Summits in Translational Science, 2020

Journal ref: AMIA Jt Summits Transl Sci Proc. 2020;2020:360-366. Published 2020 May 30

arXiv:1912.12371 [pdf]

Open Source Software Sustainability Models: Initial White Paper from the Informatics Technology for Cancer Research Sustainability and Industry Partnership Work Group

Authors: Y. Ye, R. D. Boyce, M. K. Davis, K. Elliston, C. Davatzikos, A. Fedorov, J. C. Fillion-Robin, I. Foster, J. Gilbertson, M. Heiskanen, J. Klemm, A. Lasso, J. V. Miller, M. Morgan, S. Pieper, B. Raumann, B. Sarachan, G. Savova, J. C. Silverstein, D. Taylor, J. Zelnis, G. Q. Zhang, M. J. Becich

Abstract: The Sustainability and Industry Partnership Work Group (SIP-WG) is a part of the National Cancer Institute Informatics Technology for Cancer Research (ITCR) program. The charter of the SIP-WG is to investigate options of long-term sustainability of open source software (OSS) developed by the ITCR, in part by developing a collection of business model archetypes that can serve as sustainability plan… ▽ More The Sustainability and Industry Partnership Work Group (SIP-WG) is a part of the National Cancer Institute Informatics Technology for Cancer Research (ITCR) program. The charter of the SIP-WG is to investigate options of long-term sustainability of open source software (OSS) developed by the ITCR, in part by developing a collection of business model archetypes that can serve as sustainability plans for ITCR OSS development initiatives. The workgroup assembled models from the ITCR program, from other studies, and via engagement of its extensive network of relationships with other organizations (e.g., Chan Zuckerberg Initiative, Open Source Initiative and Software Sustainability Institute). This article reviews existing sustainability models and describes ten OSS use cases disseminated by the SIP-WG and others, and highlights five essential attributes (alignment with unmet scientific needs, dedicated development team, vibrant user community, feasible licensing model, and sustainable financial model) to assist academic software developers in achieving best practice in software sustainability. △ Less

Submitted 1 January, 2020; v1 submitted 27 December, 2019; originally announced December 2019.

Comments: 21-page main manuscript, 43-page supplemental file

Showing 1–14 of 14 results for author: Savova, G