-
Trustworthy Medical Question Answering: An Evaluation-Centric Survey
Authors:
Yinuo Wang,
Robert E. Mercer,
Frank Rudzicz,
Sudipta Singha Roy,
Pengjie Ren,
Zhumin Chen,
Xindi Wang
Abstract:
Trustworthiness in healthcare question-answering (QA) systems is important for ensuring patient safety, clinical effectiveness, and user confidence. As large language models (LLMs) become increasingly integrated into medical settings, the reliability of their responses directly influences clinical decision-making and patient outcomes. However, achieving comprehensive trustworthiness in medical QA…
▽ More
Trustworthiness in healthcare question-answering (QA) systems is important for ensuring patient safety, clinical effectiveness, and user confidence. As large language models (LLMs) become increasingly integrated into medical settings, the reliability of their responses directly influences clinical decision-making and patient outcomes. However, achieving comprehensive trustworthiness in medical QA poses significant challenges due to the inherent complexity of healthcare data, the critical nature of clinical scenarios, and the multifaceted dimensions of trustworthy AI. In this survey, we systematically examine six key dimensions of trustworthiness in medical QA, i.e., Factuality, Robustness, Fairness, Safety, Explainability, and Calibration. We review how each dimension is evaluated in existing LLM-based medical QA systems. We compile and compare major benchmarks designed to assess these dimensions and analyze evaluation-guided techniques that drive model improvements, such as retrieval-augmented grounding, adversarial fine-tuning, and safety alignment. Finally, we identify open challenges-such as scalable expert evaluation, integrated multi-dimensional metrics, and real-world deployment studies-and propose future research directions to advance the safe, reliable, and transparent deployment of LLM-powered medical QA.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Graph-tree Fusion Model with Bidirectional Information Propagation for Long Document Classification
Authors:
Sudipta Singha Roy,
Xindi Wang,
Robert E. Mercer,
Frank Rudzicz
Abstract:
Long document classification presents challenges in capturing both local and global dependencies due to their extensive content and complex structure. Existing methods often struggle with token limits and fail to adequately model hierarchical relationships within documents. To address these constraints, we propose a novel model leveraging a graph-tree structure. Our approach integrates syntax tree…
▽ More
Long document classification presents challenges in capturing both local and global dependencies due to their extensive content and complex structure. Existing methods often struggle with token limits and fail to adequately model hierarchical relationships within documents. To address these constraints, we propose a novel model leveraging a graph-tree structure. Our approach integrates syntax trees for sentence encodings and document graphs for document encodings, which capture fine-grained syntactic relationships and broader document contexts, respectively. We use Tree Transformers to generate sentence encodings, while a graph attention network models inter- and intra-sentence dependencies. During training, we implement bidirectional information propagation from word-to-sentence-to-document and vice versa, which enriches the contextual representation. Our proposed method enables a comprehensive understanding of content at all hierarchical levels and effectively handles arbitrarily long contexts without token limit constraints. Experimental results demonstrate the effectiveness of our approach in all types of long document classification tasks.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
Multi-stage Retrieve and Re-rank Model for Automatic Medical Coding Recommendation
Authors:
Xindi Wang,
Robert E. Mercer,
Frank Rudzicz
Abstract:
The International Classification of Diseases (ICD) serves as a definitive medical classification system encompassing a wide range of diseases and conditions. The primary objective of ICD indexing is to allocate a subset of ICD codes to a medical record, which facilitates standardized documentation and management of various health conditions. Most existing approaches have suffered from selecting th…
▽ More
The International Classification of Diseases (ICD) serves as a definitive medical classification system encompassing a wide range of diseases and conditions. The primary objective of ICD indexing is to allocate a subset of ICD codes to a medical record, which facilitates standardized documentation and management of various health conditions. Most existing approaches have suffered from selecting the proper label subsets from an extremely large ICD collection with a heavy long-tailed label distribution. In this paper, we leverage a multi-stage ``retrieve and re-rank'' framework as a novel solution to ICD indexing, via a hybrid discrete retrieval method, and re-rank retrieved candidates with contrastive learning that allows the model to make more accurate predictions from a simplified label space. The retrieval model is a hybrid of auxiliary knowledge of the electronic health records (EHR) and a discrete retrieval method (BM25), which efficiently collects high-quality candidates. In the last stage, we propose a label co-occurrence guided contrastive re-ranking model, which re-ranks the candidate labels by pulling together the clinical notes with positive ICD codes. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures on the MIMIC-III benchmark.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Auxiliary Knowledge-Induced Learning for Automatic Multi-Label Medical Document Classification
Authors:
Xindi Wang,
Robert E. Mercer,
Frank Rudzicz
Abstract:
The International Classification of Diseases (ICD) is an authoritative medical classification system of different diseases and conditions for clinical and management purposes. ICD indexing assigns a subset of ICD codes to a medical record. Since human coding is labour-intensive and error-prone, many studies employ machine learning to automate the coding process. ICD coding is a challenging task, a…
▽ More
The International Classification of Diseases (ICD) is an authoritative medical classification system of different diseases and conditions for clinical and management purposes. ICD indexing assigns a subset of ICD codes to a medical record. Since human coding is labour-intensive and error-prone, many studies employ machine learning to automate the coding process. ICD coding is a challenging task, as it needs to assign multiple codes to each medical document from an extremely large hierarchically organized collection. In this paper, we propose a novel approach for ICD indexing that adopts three ideas: (1) we use a multi-level deep dilated residual convolution encoder to aggregate the information from the clinical notes and learn document representations across different lengths of the texts; (2) we formalize the task of ICD classification with auxiliary knowledge of the medical records, which incorporates not only the clinical texts but also different clinical code terminologies and drug prescriptions for better inferring the ICD codes; and (3) we introduce a graph convolutional network to leverage the co-occurrence patterns among ICD codes, aiming to enhance the quality of label representations. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Investigating the Learning Behaviour of In-context Learning: A Comparison with Supervised Learning
Authors:
Xindi Wang,
Yufei Wang,
Can Xu,
Xiubo Geng,
Bowen Zhang,
Chongyang Tao,
Frank Rudzicz,
Robert E. Mercer,
Daxin Jiang
Abstract:
Large language models (LLMs) have shown remarkable capacity for in-context learning (ICL), where learning a new task from just a few training examples is done without being explicitly pre-trained. However, despite the success of LLMs, there has been little understanding of how ICL learns the knowledge from the given prompts. In this paper, to make progress toward understanding the learning behavio…
▽ More
Large language models (LLMs) have shown remarkable capacity for in-context learning (ICL), where learning a new task from just a few training examples is done without being explicitly pre-trained. However, despite the success of LLMs, there has been little understanding of how ICL learns the knowledge from the given prompts. In this paper, to make progress toward understanding the learning behaviour of ICL, we train the same LLMs with the same demonstration examples via ICL and supervised learning (SL), respectively, and investigate their performance under label perturbations (i.e., noisy labels and label imbalance) on a range of classification tasks. First, via extensive experiments, we find that gold labels have significant impacts on the downstream in-context performance, especially for large language models; however, imbalanced labels matter little to ICL across all model sizes. Second, when comparing with SL, we show empirically that ICL is less sensitive to label perturbations than SL, and ICL gradually attains comparable performance to SL as the model size increases.
△ Less
Submitted 1 August, 2023; v1 submitted 28 July, 2023;
originally announced July 2023.
-
MeSHup: A Corpus for Full Text Biomedical Document Indexing
Authors:
Xindi Wang,
Robert E. Mercer,
Frank Rudzicz
Abstract:
Medical Subject Heading (MeSH) indexing refers to the problem of assigning a given biomedical document with the most relevant labels from an extremely large set of MeSH terms. Currently, the vast number of biomedical articles in the PubMed database are manually annotated by human curators, which is time consuming and costly; therefore, a computational system that can assist the indexing is highly…
▽ More
Medical Subject Heading (MeSH) indexing refers to the problem of assigning a given biomedical document with the most relevant labels from an extremely large set of MeSH terms. Currently, the vast number of biomedical articles in the PubMed database are manually annotated by human curators, which is time consuming and costly; therefore, a computational system that can assist the indexing is highly valuable. When developing supervised MeSH indexing systems, the availability of a large-scale annotated text corpus is desirable. A publicly available, large corpus that permits robust evaluation and comparison of various systems is important to the research community. We release a large scale annotated MeSH indexing corpus, MeSHup, which contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected from the MEDLINE database. We train an end-to-end model that combines features from documents and their associated labels on our corpus and report the new baseline.
△ Less
Submitted 28 April, 2022;
originally announced April 2022.
-
Evaluation Benchmarks for Spanish Sentence Representations
Authors:
Vladimir Araujo,
Andrés Carvallo,
Souvik Kundu,
José Cañete,
Marcelo Mendoza,
Robert E. Mercer,
Felipe Bravo-Marquez,
Marie-Francine Moens,
Alvaro Soto
Abstract:
Due to the success of pre-trained language models, versions of languages other than English have been released in recent years. This fact implies the need for resources to evaluate these models. In the case of Spanish, there are few ways to systematically assess the models' quality. In this paper, we narrow the gap by building two evaluation benchmarks. Inspired by previous work (Conneau and Kiela…
▽ More
Due to the success of pre-trained language models, versions of languages other than English have been released in recent years. This fact implies the need for resources to evaluate these models. In the case of Spanish, there are few ways to systematically assess the models' quality. In this paper, we narrow the gap by building two evaluation benchmarks. Inspired by previous work (Conneau and Kiela, 2018; Chen et al., 2019), we introduce Spanish SentEval and Spanish DiscoEval, aiming to assess the capabilities of stand-alone and discourse-aware sentence representations, respectively. Our benchmarks include considerable pre-existing and newly constructed datasets that address different tasks from various domains. In addition, we evaluate and analyze the most recent pre-trained Spanish language models to exhibit their capabilities and limitations. As an example, we discover that for the case of discourse evaluation tasks, mBERT, a language model trained on multiple languages, usually provides a richer latent representation than models trained only with documents in Spanish. We hope our contribution will motivate a fairer, more comparable, and less cumbersome way to evaluate future Spanish language models.
△ Less
Submitted 15 April, 2022;
originally announced April 2022.
-
KenMeSH: Knowledge-enhanced End-to-end Biomedical Text Labelling
Authors:
Xindi Wang,
Robert E. Mercer,
Frank Rudzicz
Abstract:
Currently, Medical Subject Headings (MeSH) are manually assigned to every biomedical article published and subsequently recorded in the PubMed database to facilitate retrieving relevant information. With the rapid growth of the PubMed database, large-scale biomedical document indexing becomes increasingly important. MeSH indexing is a challenging task for machine learning, as it needs to assign mu…
▽ More
Currently, Medical Subject Headings (MeSH) are manually assigned to every biomedical article published and subsequently recorded in the PubMed database to facilitate retrieving relevant information. With the rapid growth of the PubMed database, large-scale biomedical document indexing becomes increasingly important. MeSH indexing is a challenging task for machine learning, as it needs to assign multiple labels to each article from an extremely large hierachically organized collection. To address this challenge, we propose KenMeSH, an end-to-end model that combines new text features and a dynamic \textbf{K}nowledge-\textbf{en}hanced mask attention that integrates document features with MeSH label hierarchy and journal correlation features to index MeSH terms. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures.
△ Less
Submitted 13 March, 2022;
originally announced March 2022.
-
Improving Tree-LSTM with Tree Attention
Authors:
Mahtab Ahmed,
Muhammad Rifayat Samee,
Robert E. Mercer
Abstract:
In Natural Language Processing (NLP), we often need to extract information from tree topology. Sentence structure can be represented via a dependency tree or a constituency tree structure. For this reason, a variant of LSTMs, named Tree-LSTM, was proposed to work on tree topology. In this paper, we design a generalized attention framework for both dependency and constituency trees by encoding vari…
▽ More
In Natural Language Processing (NLP), we often need to extract information from tree topology. Sentence structure can be represented via a dependency tree or a constituency tree structure. For this reason, a variant of LSTMs, named Tree-LSTM, was proposed to work on tree topology. In this paper, we design a generalized attention framework for both dependency and constituency trees by encoding variants of decomposable attention inside a Tree-LSTM cell. We evaluated our models on a semantic relatedness task and achieved notable results compared to Tree-LSTM based methods with no attention as well as other neural and non-neural methods and good results compared to Tree-LSTM based methods with attention.
△ Less
Submitted 31 December, 2018;
originally announced January 2019.
-
A Novel Neural Sequence Model with Multiple Attentions for Word Sense Disambiguation
Authors:
Mahtab Ahmed,
Muhammad Rifayat Samee,
Robert E. Mercer
Abstract:
Word sense disambiguation (WSD) is a well researched problem in computational linguistics. Different research works have approached this problem in different ways. Some state of the art results that have been achieved for this problem are by supervised models in terms of accuracy, but they often fall behind flexible knowledge-based solutions which use engineered features as well as human annotator…
▽ More
Word sense disambiguation (WSD) is a well researched problem in computational linguistics. Different research works have approached this problem in different ways. Some state of the art results that have been achieved for this problem are by supervised models in terms of accuracy, but they often fall behind flexible knowledge-based solutions which use engineered features as well as human annotators to disambiguate every target word. This work focuses on bridging this gap using neural sequence models incorporating the well-known attention mechanism. The main gist of our work is to combine multiple attentions on different linguistic features through weights and to provide a unified framework for doing this. This weighted attention allows the model to easily disambiguate the sense of an ambiguous word by attending over a suitable portion of a sentence. Our extensive experiments show that multiple attention enables a more versatile encoder-decoder model leading to state of the art results.
△ Less
Submitted 4 September, 2018;
originally announced September 2018.
-
Identifying Protein-Protein Interaction using Tree LSTM and Structured Attention
Authors:
Mahtab Ahmed,
Jumayel Islam,
Muhammad Rifayat Samee,
Robert E. Mercer
Abstract:
Identifying interactions between proteins is important to understand underlying biological processes. Extracting a protein-protein interaction (PPI) from the raw text is often very difficult. Previous supervised learning methods have used handcrafted features on human-annotated data sets. In this paper, we propose a novel tree recurrent neural network with structured attention architecture for doi…
▽ More
Identifying interactions between proteins is important to understand underlying biological processes. Extracting a protein-protein interaction (PPI) from the raw text is often very difficult. Previous supervised learning methods have used handcrafted features on human-annotated data sets. In this paper, we propose a novel tree recurrent neural network with structured attention architecture for doing PPI. Our architecture achieves state of the art results (precision, recall, and F1-score) on the AIMed and BioInfer benchmark data sets. Moreover, our models achieve a significant improvement over previous best models without any explicit feature extraction. Our experimental results show that traditional recurrent networks have inferior performance compared to tree recurrent networks for the supervised PPI problem.
△ Less
Submitted 27 July, 2018;
originally announced August 2018.
-
Improving Neural Sequence Labelling using Additional Linguistic Information
Authors:
Mahtab Ahmed,
Muhammad Rifayat Samee,
Robert E. Mercer
Abstract:
Sequence labelling is the task of assigning categorical labels to a data sequence. In Natural Language Processing, sequence labelling can be applied to various fundamental problems, such as Part of Speech (POS) tagging, Named Entity Recognition (NER), and Chunking. In this study, we propose a method to add various linguistic features to the neural sequence framework to improve sequence labelling.…
▽ More
Sequence labelling is the task of assigning categorical labels to a data sequence. In Natural Language Processing, sequence labelling can be applied to various fundamental problems, such as Part of Speech (POS) tagging, Named Entity Recognition (NER), and Chunking. In this study, we propose a method to add various linguistic features to the neural sequence framework to improve sequence labelling. Besides word level knowledge, sense embeddings are added to provide semantic information. Additionally, selective readings of character embeddings are added to capture contextual as well as morphological features for each word in a sentence. Compared to previous methods, these added linguistic features allow us to design a more concise model and perform more efficient training. Our proposed architecture achieves state of the art results on the benchmark datasets of POS, NER, and chunking. Moreover, the convergence rate of our model is significantly better than the previous state of the art models.
△ Less
Submitted 27 July, 2018;
originally announced July 2018.
-
Extracting Connected Concepts from Biomedical Texts using Fog Index
Authors:
Rushdi Shams,
Robert E. Mercer
Abstract:
In this paper, we establish Fog Index (FI) as a text filter to locate the sentences in texts that contain connected biomedical concepts of interest. To do so, we have used 24 random papers each containing four pairs of connected concepts. For each pair, we categorize sentences based on whether they contain both, any or none of the concepts. We then use FI to measure difficulty of the sentences of…
▽ More
In this paper, we establish Fog Index (FI) as a text filter to locate the sentences in texts that contain connected biomedical concepts of interest. To do so, we have used 24 random papers each containing four pairs of connected concepts. For each pair, we categorize sentences based on whether they contain both, any or none of the concepts. We then use FI to measure difficulty of the sentences of each category and find that sentences containing both of the concepts have low readability. We rank sentences of a text according to their FI and select 30 percent of the most difficult sentences. We use an association matrix to track the most frequent pairs of concepts in them. This matrix reports that the first filter produces some pairs that hold almost no connections. To remove these unwanted pairs, we use the Equally Weighted Harmonic Mean of their Positive Predictive Value (PPV) and Sensitivity as a second filter. Experimental results demonstrate the effectiveness of our method.
△ Less
Submitted 30 July, 2013;
originally announced July 2013.
-
Investigating Keyphrase Indexing with Text Denoising
Authors:
Rushdi Shams,
Robert E. Mercer
Abstract:
In this paper, we report on indexing performance by a state-of-the-art keyphrase indexer, Maui, when paired with a text extraction procedure called text denoising. Text denoising is a method that extracts the denoised text, comprising the content-rich sentences, from full texts. The performance of the keyphrase indexer is demonstrated on three standard corpora collected from three domains, namely…
▽ More
In this paper, we report on indexing performance by a state-of-the-art keyphrase indexer, Maui, when paired with a text extraction procedure called text denoising. Text denoising is a method that extracts the denoised text, comprising the content-rich sentences, from full texts. The performance of the keyphrase indexer is demonstrated on three standard corpora collected from three domains, namely food and agriculture, high energy physics, and biomedical science. Maui is trained using the full texts and denoised texts. The indexer, using its trained models, then extracts keyphrases from test sets comprising full texts, and their denoised and noise parts (i.e., the part of texts that remains after denoising). Experimental findings show that against a gold standard, the denoised-text-trained indexer indexing full texts, performs either better than or as good as its benchmark performance produced by a full-text-trained indexer indexing full texts.
△ Less
Submitted 10 April, 2012;
originally announced April 2012.
-
Optimal Belief Revision
Authors:
Carmen Vodislav,
Robert E. Mercer
Abstract:
We propose a new approach to belief revision that provides a way to change knowledge bases with a minimum of effort. We call this way of revising belief states optimal belief revision. Our revision method gives special attention to the fact that most belief revision processes are directed to a specific informational objective. This approach to belief change is founded on notions such as optimal…
▽ More
We propose a new approach to belief revision that provides a way to change knowledge bases with a minimum of effort. We call this way of revising belief states optimal belief revision. Our revision method gives special attention to the fact that most belief revision processes are directed to a specific informational objective. This approach to belief change is founded on notions such as optimal context and accessibility. For the sentential model of belief states we provide both a formal description of contexts as sub-theories determined by three parameters and a method to construct contexts. Next, we introduce an accessibility ordering for belief sets, which we then use for selecting the best (optimal) contexts with respect to the processing effort involved in the revision. Then, for finitely axiomatizable knowledge bases, we characterize a finite accessibility ranking from which the accessibility ordering for the entire base is generated and show how to determine the ranking of an arbitrary sentence in the language. Finally, we define the adjustment of the accessibility ranking of a revised base of a belief set.
△ Less
Submitted 8 March, 2000;
originally announced March 2000.