-
Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine
Authors:
Sebastian Joseph,
Lily Chen,
Barry Wei,
Michael Mackert,
Iain J. Marshall,
Paul Pu Liang,
Ramez Kouzy,
Byron C. Wallace,
Junyi Jessy Li
Abstract:
Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet…
▽ More
Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet the nature of it is highly technical, rendering the medical literacy of majority users inadequate to sufficiently navigate the domain. Such problems with medical communication ripens the ground for end-to-end fact-checking agents: check a claim against current medical literature and return with an evidence-backed verdict. And yet, such systems remain largely unused. To understand this, we present the first study examining how clinical experts verify real claims from social media by synthesizing medical evidence. In searching for this upper-bound, we reveal fundamental challenges in end-to-end fact-checking when applied to medicine: Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials; ambiguities in underspecified claims mixed with mismatched intentions; and inherently subjective veracity labels. We argue that fact-checking should be approached and evaluated as an interactive communication problem, rather than an end-to-end process.
△ Less
Submitted 25 June, 2025;
originally announced June 2025.
-
The Dual-Route Model of Induction
Authors:
Sheridan Feucht,
Eric Todd,
Byron Wallace,
David Bau
Abstract:
Prior work on in-context copying has shown the existence of induction heads, which attend to and promote individual tokens during copying. In this work we introduce a new type of induction head: concept-level induction heads, which copy entire lexical units instead of individual tokens. Concept induction heads learn to attend to the ends of multi-token words throughout training, working in paralle…
▽ More
Prior work on in-context copying has shown the existence of induction heads, which attend to and promote individual tokens during copying. In this work we introduce a new type of induction head: concept-level induction heads, which copy entire lexical units instead of individual tokens. Concept induction heads learn to attend to the ends of multi-token words throughout training, working in parallel with token-level induction heads to copy meaningful text. We show that these heads are responsible for semantic tasks like word-level translation, whereas token induction heads are vital for tasks that can only be done verbatim, like copying nonsense tokens. These two "routes" operate independently: in fact, we show that ablation of token induction heads causes models to paraphrase where they would otherwise copy verbatim. In light of these findings, we argue that although token induction heads are vital for specific tasks, concept induction heads may be more broadly relevant for in-context learning.
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare
Authors:
Hiba Ahsan,
Arnab Sen Sharma,
Silvio Amir,
David Bau,
Byron C. Wallace
Abstract:
We know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gende…
▽ More
We know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gender information is highly localized in middle MLP layers and can be reliably manipulated at inference time via patching. Such interventions can surgically alter generated clinical vignettes for specific conditions, and also influence downstream clinical predictions which correlate with gender, e.g., patient risk of depression. We find that representation of patient race is somewhat more distributed, but can also be intervened upon, to a degree. To our knowledge, this is the first application of mechanistic interpretability methods to LLMs for healthcare.
△ Less
Submitted 18 February, 2025;
originally announced February 2025.
-
Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?
Authors:
Hye Sun Yun,
Karen Y. C. Zhang,
Ramez Kouzy,
Iain J. Marshall,
Junyi Jessy Li,
Byron C. Wallace
Abstract:
Medical research faces well-documented challenges in translating novel treatments into clinical practice. Publishing incentives encourage researchers to present "positive" findings, even when empirical results are equivocal. Consequently, it is well-documented that authors often spin study results, especially in article abstracts. Such spin can influence clinician interpretation of evidence and ma…
▽ More
Medical research faces well-documented challenges in translating novel treatments into clinical practice. Publishing incentives encourage researchers to present "positive" findings, even when empirical results are equivocal. Consequently, it is well-documented that authors often spin study results, especially in article abstracts. Such spin can influence clinician interpretation of evidence and may affect patient care decisions. In this study, we ask whether the interpretation of trial results offered by Large Language Models (LLMs) is similarly affected by spin. This is important since LLMs are increasingly being used to trawl through and synthesize published medical evidence. We evaluated 22 LLMs and found that they are across the board more susceptible to spin than humans. They might also propagate spin into their outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into plain language summaries that they generate. We also find, however, that LLMs are generally capable of recognizing spin, and can be prompted in a way to mitigate spin's impact on LLM outputs.
△ Less
Submitted 5 May, 2025; v1 submitted 11 February, 2025;
originally announced February 2025.
-
Who Taught You That? Tracing Teachers in Model Distillation
Authors:
Somin Wadhwa,
Chantal Shaib,
Silvio Amir,
Byron C. Wallace
Abstract:
Model distillation -- using outputs from a large teacher model to teach a small student model -- is a practical means of creating efficient models for a particular task. We ask: Can we identify a students' teacher based on its outputs? Such "footprints" left by teacher LLMs would be interesting artifacts. Beyond this, reliable teacher inference may have practical implications as actors seek to dis…
▽ More
Model distillation -- using outputs from a large teacher model to teach a small student model -- is a practical means of creating efficient models for a particular task. We ask: Can we identify a students' teacher based on its outputs? Such "footprints" left by teacher LLMs would be interesting artifacts. Beyond this, reliable teacher inference may have practical implications as actors seek to distill specific capabilities of massive proprietary LLMs into deployed smaller LMs, potentially violating terms of service. We consider practical task distillation targets including summarization, question answering, and instruction-following. We assume a finite set of candidate teacher models, which we treat as blackboxes. We design discriminative models that operate over lexical features. We find that $n$-gram similarity alone is unreliable for identifying teachers, but part-of-speech (PoS) templates preferred by student models mimic those of their teachers.
△ Less
Submitted 20 May, 2025; v1 submitted 10 February, 2025;
originally announced February 2025.
-
Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation
Authors:
Sanjana Ramprasad,
Byron C. Wallace
Abstract:
Modern LLMs can now produce highly readable abstractive summaries, to the point where traditional automated metrics for evaluating summary quality, such as ROUGE, have become saturated. However, LLMs still sometimes introduce unwanted content into summaries, i.e., information inconsistent with or unsupported by their source. Measuring the occurrence of these often subtle ``hallucinations'' automat…
▽ More
Modern LLMs can now produce highly readable abstractive summaries, to the point where traditional automated metrics for evaluating summary quality, such as ROUGE, have become saturated. However, LLMs still sometimes introduce unwanted content into summaries, i.e., information inconsistent with or unsupported by their source. Measuring the occurrence of these often subtle ``hallucinations'' automatically has proved to be challenging. This in turn has motivated development of a variety of metrics intended to measure the factual consistency of generated summaries against their source. But are these approaches measuring what they purport to do? In this work, we stress-test automatic factuality metrics. Specifically, we investigate whether and to what degree superficial attributes of summary texts suffice to predict ``factuality'', finding that a (supervised) model using only such shallow features is reasonably competitive with SOTA factuality scoring methods. We then evaluate how factuality metrics respond to factual corrections in inconsistent summaries and find that only a few show meaningful improvements. In contrast, some metrics are more sensitive to benign, non-factual edits. Motivated by these insights, we show that one can ``game'' (most) automatic factuality metrics, i.e., reliably inflate ``factuality'' scores by appending innocuous sentences to generated summaries. Taken together, our results raise questions about the degree to which we should rely on existing automated factuality metrics and what exactly we want ``factuality metrics'' to measure.
△ Less
Submitted 28 November, 2024; v1 submitted 25 November, 2024;
originally announced November 2024.
-
IPMN Risk Assessment under Federated Learning Paradigm
Authors:
Hongyi Pan,
Ziliang Hong,
Gorkem Durak,
Elif Keles,
Halil Ertugrul Aktas,
Yavuz Taktak,
Alpay Medetalibeyoglu,
Zheyuan Zhang,
Yury Velichko,
Concetto Spampinato,
Ivo Schoots,
Marco J. Bruno,
Pallavi Tiwari,
Candice Bolan,
Tamas Gonda,
Frank Miller,
Rajesh N. Keswani,
Michael B. Wallace,
Ziyue Xu,
Ulas Bagci
Abstract:
Accurate classification of Intraductal Papillary Mucinous Neoplasms (IPMN) is essential for identifying high-risk cases that require timely intervention. In this study, we develop a federated learning framework for multi-center IPMN classification utilizing a comprehensive pancreas MRI dataset. This dataset includes 652 T1-weighted and 655 T2-weighted MRI images, accompanied by corresponding IPMN…
▽ More
Accurate classification of Intraductal Papillary Mucinous Neoplasms (IPMN) is essential for identifying high-risk cases that require timely intervention. In this study, we develop a federated learning framework for multi-center IPMN classification utilizing a comprehensive pancreas MRI dataset. This dataset includes 652 T1-weighted and 655 T2-weighted MRI images, accompanied by corresponding IPMN risk scores from 7 leading medical institutions, making it the largest and most diverse dataset for IPMN classification to date. We assess the performance of DenseNet-121 in both centralized and federated settings for training on distributed data. Our results demonstrate that the federated learning approach achieves high classification accuracy comparable to centralized learning while ensuring data privacy across institutions. This work marks a significant advancement in collaborative IPMN classification, facilitating secure and high-accuracy model training across multiple centers.
△ Less
Submitted 22 January, 2025; v1 submitted 8 November, 2024;
originally announced November 2024.
-
Adaptive Aggregation Weights for Federated Segmentation of Pancreas MRI
Authors:
Hongyi Pan,
Gorkem Durak,
Zheyuan Zhang,
Yavuz Taktak,
Elif Keles,
Halil Ertugrul Aktas,
Alpay Medetalibeyoglu,
Yury Velichko,
Concetto Spampinato,
Ivo Schoots,
Marco J. Bruno,
Rajesh N. Keswani,
Pallavi Tiwari,
Candice Bolan,
Tamas Gonda,
Michael G. Goggins,
Michael B. Wallace,
Ziyue Xu,
Ulas Bagci
Abstract:
Federated learning (FL) enables collaborative model training across institutions without sharing sensitive data, making it an attractive solution for medical imaging tasks. However, traditional FL methods, such as Federated Averaging (FedAvg), face difficulties in generalizing across domains due to variations in imaging protocols and patient demographics across institutions. This challenge is part…
▽ More
Federated learning (FL) enables collaborative model training across institutions without sharing sensitive data, making it an attractive solution for medical imaging tasks. However, traditional FL methods, such as Federated Averaging (FedAvg), face difficulties in generalizing across domains due to variations in imaging protocols and patient demographics across institutions. This challenge is particularly evident in pancreas MRI segmentation, where anatomical variability and imaging artifacts significantly impact performance. In this paper, we conduct a comprehensive evaluation of FL algorithms for pancreas MRI segmentation and introduce a novel approach that incorporates adaptive aggregation weights. By dynamically adjusting the contribution of each client during model aggregation, our method accounts for domain-specific differences and improves generalization across heterogeneous datasets. Experimental results demonstrate that our approach enhances segmentation accuracy and reduces the impact of domain shift compared to conventional FL methods while maintaining privacy-preserving capabilities. Significant performance improvements are observed across multiple hospitals (centers).
△ Less
Submitted 6 May, 2025; v1 submitted 29 October, 2024;
originally announced October 2024.
-
Characterising the z $\sim$ 7.66 Type-II AGN candidate SMACS S06355 using BEAGLE-AGN and JWST NIRSpec/NIRCam
Authors:
M. S. Silcock,
E. Curtis-Lake,
D. J. B. Smith,
I. E. B. Wallace,
A. Vidal-García,
A. Plat,
M. Hirschmann,
A. Feltre,
J. Chevallard,
S. Charlot,
S. Carniani,
A. J. Bunker
Abstract:
The presence of Active Galactic Nuclei (AGN) in low mass (Mstar $\lesssim$ $10^{9}$ Msun) galaxies at high redshift has been established, and it is important to characterise these objects and the impact of their feedback on the host galaxies. In this paper we apply the Spectral Energy Distribution (SED) fitting code BEAGLE-AGN to SMACS S06355, a z $\sim$ 7.66 Type-II AGN candidate from the JWST NI…
▽ More
The presence of Active Galactic Nuclei (AGN) in low mass (Mstar $\lesssim$ $10^{9}$ Msun) galaxies at high redshift has been established, and it is important to characterise these objects and the impact of their feedback on the host galaxies. In this paper we apply the Spectral Energy Distribution (SED) fitting code BEAGLE-AGN to SMACS S06355, a z $\sim$ 7.66 Type-II AGN candidate from the JWST NIRSpec Early Release Observations. This object's spectrum includes a detection of the [NeIV]2426 line, indicating an obscured AGN due to its high ionization potential energy ($\sim$ 63eV). We use BEAGLE-AGN to simultaneously model the Narrow Line Region (NLR) AGN and star-forming galaxy contributions to the observed line fluxes and photometry. Having a high-ionization emission line allows the contribution of the NLR to the remaining lines to be probabilistically disentangled. The HII region metallicity is derived to be 12+log(O/H)$^{\mathrm{HII}}$ = $7.82^{+0.18}_{-0.19}$. Assuming that the Neon-to-Oxygen abundance is similar to solar we derive a high NLR metallicity of 12+log(O/H)$^\mathrm{NLR}$ = $8.86^{+0.14}_{-0.16}$, with the 2$σ$ lower-limit extending to 12+log(O/H)$^{\mathrm{NLR}}$ $\sim$ 8.54, showing the derivation is uncertain. We discuss this result with respect to non-solar Neon abundances that might boost the inferred NLR metallicity. The NLR metallicity places SMACS S06355 in a comparable region of the mass-metallicity plane to intermediate (1.5 $\lesssim$ z $\lesssim$ 3.0) redshift obscured AGN. Our derived accretion disc luminosity, log($L_{acc}$ / erg $s^{-1}$) = $45.19^{+0.12}_{-0.11}$, is moderately high yet still uncertain. We highlight that deviations between bolometric luminosity calibrations and model grid tracks become enhanced at low metallicities.
△ Less
Submitted 27 June, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
Decentralized Uncertainty-Aware Active Search with a Team of Aerial Robots
Authors:
Wennie Tabib,
John Stecklein,
Caleb McDowell,
Kshitij Goel,
Felix Jonathan,
Abhishek Rathod,
Meghan Kokoski,
Edsel Burkholder,
Brian Wallace,
Luis Ernesto Navarro-Serment,
Nikhil Angad Bakshi,
Tejus Gupta,
Norman Papernick,
David Guttendorf,
Erik E. Kahn,
Jessica Kasemer,
Jesse Holdaway,
Jeff Schneider
Abstract:
Rapid search and rescue is critical to maximizing survival rates following natural disasters. However, these efforts are challenged by the need to search large disaster zones, lack of reliability in the communications infrastructure, and a priori unknown numbers of objects of interest (OOIs), such as injured survivors. Aerial robots are increasingly being deployed for search and rescue due to thei…
▽ More
Rapid search and rescue is critical to maximizing survival rates following natural disasters. However, these efforts are challenged by the need to search large disaster zones, lack of reliability in the communications infrastructure, and a priori unknown numbers of objects of interest (OOIs), such as injured survivors. Aerial robots are increasingly being deployed for search and rescue due to their high mobility, but there remains a gap in deploying multi-robot autonomous aerial systems for methodical search of large environments. Prior works have relied on preprogrammed paths from human operators or are evaluated only in simulation. We bridge these gaps in the state of the art by developing and demonstrating a decentralized active search system, which biases its trajectories to take additional views of uncertain OOIs. The methodology leverages stochasticity for rapid coverage in communication denied scenarios. When communications are available, robots share poses, goals, and OOI information to accelerate the rate of search. Detections from multiple images and vehicles are fused to provide a mean and covariance for each OOI location. Extensive simulations and hardware experiments in Bloomingdale, OH, are conducted to validate the approach. The results demonstrate the active search approach outperforms greedy coverage-based planning in communication-denied scenarios while maintaining comparable performance in communication-enabled scenarios. The results also demonstrate the ability to detect and localize all a priori unknown OOIs with a mean error of approximately 3m at flight altitudes between 50m-60m.
△ Less
Submitted 10 June, 2025; v1 submitted 11 October, 2024;
originally announced October 2024.
-
Optimizing Synthetic Data for Enhanced Pancreatic Tumor Segmentation
Authors:
Linkai Peng,
Zheyuan Zhang,
Gorkem Durak,
Frank H. Miller,
Alpay Medetalibeyoglu,
Michael B. Wallace,
Ulas Bagci
Abstract:
Pancreatic cancer remains one of the leading causes of cancer-related mortality worldwide. Precise segmentation of pancreatic tumors from medical images is a bottleneck for effective clinical decision-making. However, achieving a high accuracy is often limited by the small size and availability of real patient data for training deep learning models. Recent approaches have employed synthetic data g…
▽ More
Pancreatic cancer remains one of the leading causes of cancer-related mortality worldwide. Precise segmentation of pancreatic tumors from medical images is a bottleneck for effective clinical decision-making. However, achieving a high accuracy is often limited by the small size and availability of real patient data for training deep learning models. Recent approaches have employed synthetic data generation to augment training datasets. While promising, these methods may not yet meet the performance benchmarks required for real-world clinical use. This study critically evaluates the limitations of existing generative-AI based frameworks for pancreatic tumor segmentation. We conduct a series of experiments to investigate the impact of synthetic \textit{tumor size} and \textit{boundary definition} precision on model performance. Our findings demonstrate that: (1) strategically selecting a combination of synthetic tumor sizes is crucial for optimal segmentation outcomes, and (2) generating synthetic tumors with precise boundaries significantly improves model accuracy. These insights highlight the importance of utilizing refined synthetic data augmentation for enhancing the clinical utility of segmentation models in pancreatic cancer decision making including diagnosis, prognosis, and treatment plans. Our code will be available at https://github.com/lkpengcs/SynTumorAnalyzer.
△ Less
Submitted 1 October, 2024; v1 submitted 27 July, 2024;
originally announced July 2024.
-
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals
Authors:
Jaden Fiotto-Kaufman,
Alexander R. Loftus,
Eric Todd,
Jannik Brinkmann,
Koyena Pal,
Dmitrii Troitskii,
Michael Ripa,
Adam Belfki,
Can Rager,
Caden Juang,
Aaron Mueller,
Samuel Marks,
Arnab Sen Sharma,
Francesca Lucchetti,
Nikhil Prakash,
Carla Brodley,
Arjun Guha,
Jonathan Bell,
Byron C. Wallace,
David Bau
Abstract:
We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks. NNsight is an open-source system that extends PyTorch to introduce deferred remote execution. The National Deep Inference Fabric (NDIF) is a scalable inference service that executes NNsight requests, allowing users to share GPU re…
▽ More
We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks. NNsight is an open-source system that extends PyTorch to introduce deferred remote execution. The National Deep Inference Fabric (NDIF) is a scalable inference service that executes NNsight requests, allowing users to share GPU resources and pretrained models. These technologies are enabled by the Intervention Graph, an architecture developed to decouple experimental design from model runtime. Together, this framework provides transparent and efficient access to the internals of deep neural networks such as very large language models (LLMs) without imposing the cost or complexity of hosting customized models individually. We conduct a quantitative survey of the machine learning literature that reveals a growing gap in the study of the internals of large-scale AI. We demonstrate the design and use of our framework to address this gap by enabling a range of research methods on huge models. Finally, we conduct benchmarks to compare performance with previous approaches.
Code, documentation, and tutorials are available at https://nnsight.net/.
△ Less
Submitted 1 April, 2025; v1 submitted 18 July, 2024;
originally announced July 2024.
-
Open (Clinical) LLMs are Sensitive to Instruction Phrasings
Authors:
Alberto Mario Ceballos Arroyo,
Monica Munnangi,
Jiuding Sun,
Karen Y. C. Zhang,
Denis Jered McInerney,
Byron C. Wallace,
Silvio Amir
Abstract:
Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain.
This raises a…
▽ More
Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain.
This raises a practical question: How robust are instruction-tuned LLMs to natural variations in the instructions provided for clinical NLP tasks? We collect prompts from medical doctors across a range of tasks and quantify the sensitivity of seven LLMs -- some general, others specialized -- to natural (i.e., non-adversarial) instruction phrasings. We find that performance varies substantially across all models, and that -- perhaps surprisingly -- domain-specific models explicitly trained on clinical data are especially brittle, compared to their general domain counterparts. Further, arbitrary phrasing differences can affect fairness, e.g., valid but distinct instructions for mortality prediction yield a range both in overall performance, and in terms of differences between demographic groups.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Detection and Measurement of Syntactic Templates in Generated Text
Authors:
Chantal Shaib,
Yanai Elazar,
Junyi Jessy Li,
Byron C. Wallace
Abstract:
Recent work on evaluating the diversity of text generated by LLMs has focused on word-level features. Here we offer an analysis of syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference texts. W…
▽ More
Recent work on evaluating the diversity of text generated by LLMs has focused on word-level features. Here we offer an analysis of syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference texts. We find that most (76%) templates in model-generated text can be found in pre-training data (compared to only 35% of human-authored text), and are not overwritten during fine-tuning processes such as RLHF. This connection to the pre-training data allows us to analyze syntactic templates in models where we do not have the pre-training data. We also find that templates as features are able to differentiate between models, tasks, and domains, and are useful for qualitatively evaluating common model constructions. Finally, we demonstrate the use of templates as a useful tool for analyzing style memorization of training data in LLMs.
△ Less
Submitted 6 October, 2024; v1 submitted 28 June, 2024;
originally announced July 2024.
-
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Authors:
Sheridan Feucht,
David Atkinson,
Byron Wallace,
David Bau
Abstract:
LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantical…
▽ More
LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.
△ Less
Submitted 11 October, 2024; v1 submitted 28 June, 2024;
originally announced June 2024.
-
Investigating Mysteries of CoT-Augmented Distillation
Authors:
Somin Wadhwa,
Silvio Amir,
Byron C. Wallace
Abstract:
Eliciting "chain of thought" (CoT) rationales -- sequences of token that convey a "reasoning" process -- has been shown to consistently improve LLM performance on tasks like question answering. More recent efforts have shown that such rationales can also be used for model distillation: Including CoT sequences (elicited from a large "teacher" model) in addition to target labels when fine-tuning a s…
▽ More
Eliciting "chain of thought" (CoT) rationales -- sequences of token that convey a "reasoning" process -- has been shown to consistently improve LLM performance on tasks like question answering. More recent efforts have shown that such rationales can also be used for model distillation: Including CoT sequences (elicited from a large "teacher" model) in addition to target labels when fine-tuning a small student model yields (often substantial) improvements. In this work we ask: Why and how does this additional training signal help in model distillation? We perform ablations to interrogate this, and report some potentially surprising results. Specifically: (1) Placing CoT sequences after labels (rather than before) realizes consistently better downstream performance -- this means that no student "reasoning" is necessary at test time to realize gains. (2) When rationales are appended in this way, they need not be coherent reasoning sequences to yield improvements; performance increases are robust to permutations of CoT tokens, for example. In fact, (3) a small number of key tokens are sufficient to achieve improvements equivalent to those observed when full rationales are used in model distillation.
△ Less
Submitted 27 September, 2024; v1 submitted 20 June, 2024;
originally announced June 2024.
-
Learning from Natural Language Explanations for Generalizable Entity Matching
Authors:
Somin Wadhwa,
Adit Krishnan,
Runhui Wang,
Byron C. Wallace,
Chris Kong
Abstract:
Entity matching is the task of linking records from different sources that refer to the same real-world entity. Past work has primarily treated entity linking as a standard supervised learning problem. However, supervised entity matching models often do not generalize well to new data, and collecting exhaustive labeled training data is often cost prohibitive. Further, recent efforts have adopted L…
▽ More
Entity matching is the task of linking records from different sources that refer to the same real-world entity. Past work has primarily treated entity linking as a standard supervised learning problem. However, supervised entity matching models often do not generalize well to new data, and collecting exhaustive labeled training data is often cost prohibitive. Further, recent efforts have adopted LLMs for this task in few/zero-shot settings, exploiting their general knowledge. But LLMs are prohibitively expensive for performing inference at scale for real-world entity matching tasks.
As an efficient alternative, we re-cast entity matching as a conditional generation task as opposed to binary classification. This enables us to "distill" LLM reasoning into smaller entity matching models via natural language explanations. This approach achieves strong performance, especially on out-of-domain generalization tests (10.85% F-1) where standalone generative methods struggle. We perform ablations that highlight the importance of explanations, both for performance and model robustness.
△ Less
Submitted 27 September, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
Large-Scale Multi-Center CT and MRI Segmentation of Pancreas with Deep Learning
Authors:
Zheyuan Zhang,
Elif Keles,
Gorkem Durak,
Yavuz Taktak,
Onkar Susladkar,
Vandan Gorade,
Debesh Jha,
Asli C. Ormeci,
Alpay Medetalibeyoglu,
Lanhong Yao,
Bin Wang,
Ilkin Sevgi Isler,
Linkai Peng,
Hongyi Pan,
Camila Lopes Vendrami,
Amir Bourhani,
Yury Velichko,
Boqing Gong,
Concetto Spampinato,
Ayis Pyrros,
Pallavi Tiwari,
Derk C. F. Klatte,
Megan Engels,
Sanne Hoogenboom,
Candice W. Bolan
, et al. (13 additional authors not shown)
Abstract:
Automated volumetric segmentation of the pancreas on cross-sectional imaging is needed for diagnosis and follow-up of pancreatic diseases. While CT-based pancreatic segmentation is more established, MRI-based segmentation methods are understudied, largely due to a lack of publicly available datasets, benchmarking research efforts, and domain-specific deep learning methods. In this retrospective st…
▽ More
Automated volumetric segmentation of the pancreas on cross-sectional imaging is needed for diagnosis and follow-up of pancreatic diseases. While CT-based pancreatic segmentation is more established, MRI-based segmentation methods are understudied, largely due to a lack of publicly available datasets, benchmarking research efforts, and domain-specific deep learning methods. In this retrospective study, we collected a large dataset (767 scans from 499 participants) of T1-weighted (T1W) and T2-weighted (T2W) abdominal MRI series from five centers between March 2004 and November 2022. We also collected CT scans of 1,350 patients from publicly available sources for benchmarking purposes. We developed a new pancreas segmentation method, called PanSegNet, combining the strengths of nnUNet and a Transformer network with a new linear attention module enabling volumetric computation. We tested PanSegNet's accuracy in cross-modality (a total of 2,117 scans) and cross-center settings with Dice and Hausdorff distance (HD95) evaluation metrics. We used Cohen's kappa statistics for intra and inter-rater agreement evaluation and paired t-tests for volume and Dice comparisons, respectively. For segmentation accuracy, we achieved Dice coefficients of 88.3% (std: 7.2%, at case level) with CT, 85.0% (std: 7.9%) with T1W MRI, and 86.3% (std: 6.4%) with T2W MRI. There was a high correlation for pancreas volume prediction with R^2 of 0.91, 0.84, and 0.85 for CT, T1W, and T2W, respectively. We found moderate inter-observer (0.624 and 0.638 for T1W and T2W MRI, respectively) and high intra-observer agreement scores. All MRI data is made available at https://osf.io/kysnj/. Our source code is available at https://github.com/NUBagciLab/PaNSegNet.
△ Less
Submitted 24 October, 2024; v1 submitted 20 May, 2024;
originally announced May 2024.
-
Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models
Authors:
Hye Sun Yun,
David Pogrebitskiy,
Iain J. Marshall,
Byron C. Wallace
Abstract:
Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individu…
▽ More
Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individual trials to be synthesized. Ideally, language technologies would permit fully automatic meta-analysis, on demand. This requires accurately extracting numerical results from individual trials, which has been beyond the capabilities of natural language processing (NLP) models to date. In this work, we evaluate whether modern large language models (LLMs) can reliably perform this task. We annotate (and release) a modest but granular evaluation dataset of clinical trial reports with numerical findings attached to interventions, comparators, and outcomes. Using this dataset, we evaluate the performance of seven LLMs applied zero-shot for the task of conditionally extracting numerical findings from trial reports. We find that massive LLMs that can accommodate lengthy inputs are tantalizingly close to realizing fully automatic meta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality). However, LLMs -- including ones trained on biomedical texts -- perform poorly when the outcome measures are complex and tallying the results requires inference. This work charts a path toward fully automatic meta-analysis of RCTs via LLMs, while also highlighting the limitations of existing models for this aim.
△ Less
Submitted 24 July, 2024; v1 submitted 2 May, 2024;
originally announced May 2024.
-
On-the-fly Definition Augmentation of LLMs for Biomedical NER
Authors:
Monica Munnangi,
Sergey Feldman,
Byron C Wallace,
Silvio Amir,
Tom Hope,
Aakanksha Naik
Abstract:
Despite their general capabilities, LLMs still struggle on biomedical NER tasks, which are difficult due to the presence of specialized terminology and lack of training data. In this work we set out to improve LLM performance on biomedical NER in limited data settings via a new knowledge augmentation approach which incorporates definitions of relevant concepts on-the-fly. During this process, to p…
▽ More
Despite their general capabilities, LLMs still struggle on biomedical NER tasks, which are difficult due to the presence of specialized terminology and lack of training data. In this work we set out to improve LLM performance on biomedical NER in limited data settings via a new knowledge augmentation approach which incorporates definitions of relevant concepts on-the-fly. During this process, to provide a test bed for knowledge augmentation, we perform a comprehensive exploration of prompting strategies. Our experiments show that definition augmentation is useful for both open source and closed LLMs. For example, it leads to a relative improvement of 15\% (on average) in GPT-4 performance (F1) across all (six) of our test datasets. We conduct extensive ablations and analyses to demonstrate that our performance improvements stem from adding relevant definitional knowledge. We find that careful prompting strategies also improve LLM performance, allowing them to outperform fine-tuned language models in few-shot settings. To facilitate future research in this direction, we release our code at https://github.com/allenai/beacon.
△ Less
Submitted 23 April, 2024; v1 submitted 29 March, 2024;
originally announced April 2024.
-
Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores
Authors:
Chantal Shaib,
Joe Barrow,
Jiuding Sun,
Alexa F. Siu,
Byron C. Wallace,
Ani Nenkova
Abstract:
The diversity across outputs generated by LLMs shapes perception of their quality and utility. High lexical diversity is often desirable, but there is no standard method to measure this property. Templated answer structures and ``canned'' responses across different documents are readily noticeable, but difficult to visualize across large corpora. This work aims to standardize measurement of text d…
▽ More
The diversity across outputs generated by LLMs shapes perception of their quality and utility. High lexical diversity is often desirable, but there is no standard method to measure this property. Templated answer structures and ``canned'' responses across different documents are readily noticeable, but difficult to visualize across large corpora. This work aims to standardize measurement of text diversity. Specifically, we empirically investigate the convergent validity of existing scores across English texts, and we release diversity, an open-source Python package for measuring and extracting repetition in text. We also build a platform based on diversity for users to interactively explore repetition in text. We find that fast compression algorithms capture information similar to what is measured by slow-to-compute $n$-gram overlap homogeneity scores. Further, a combination of measures -- compression ratios, self-repetition of long $n$-grams, and Self-BLEU and BERTScore -- are sufficient to report, as they have low mutual correlation with each other.
△ Less
Submitted 20 March, 2025; v1 submitted 1 March, 2024;
originally announced March 2024.
-
How Much Annotation is Needed to Compare Summarization Models?
Authors:
Chantal Shaib,
Joe Barrow,
Alexa F. Siu,
Byron C. Wallace,
Ani Nenkova
Abstract:
Modern instruction-tuned models have become highly capable in text generation tasks such as summarization, and are expected to be released at a steady pace. In practice one may now wish to choose confidently, but with minimal effort, the best performing summarization model when applied to a new domain or purpose. In this work, we empirically investigate the test sample size necessary to select a p…
▽ More
Modern instruction-tuned models have become highly capable in text generation tasks such as summarization, and are expected to be released at a steady pace. In practice one may now wish to choose confidently, but with minimal effort, the best performing summarization model when applied to a new domain or purpose. In this work, we empirically investigate the test sample size necessary to select a preferred model in the context of news summarization. Empirical results reveal that comparative evaluation converges quickly for both automatic and human evaluation, with clear preferences for a system emerging from under 100 examples. The human preference data allows us to quantify how well automatic scores can reproduce preference rankings across a variety of downstream summarization tasks. We find that, while automatic metrics are stable at smaller sample sizes, only some automatic metrics are able to moderately predict model win rates according to human preference.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
Leveraging ChatGPT in Pharmacovigilance Event Extraction: An Empirical Study
Authors:
Zhaoyue Sun,
Gabriele Pergola,
Byron C. Wallace,
Yulan He
Abstract:
With the advent of large language models (LLMs), there has been growing interest in exploring their potential for medical applications. This research aims to investigate the ability of LLMs, specifically ChatGPT, in the context of pharmacovigilance event extraction, of which the main goal is to identify and extract adverse events or potential therapeutic events from textual medical sources. We con…
▽ More
With the advent of large language models (LLMs), there has been growing interest in exploring their potential for medical applications. This research aims to investigate the ability of LLMs, specifically ChatGPT, in the context of pharmacovigilance event extraction, of which the main goal is to identify and extract adverse events or potential therapeutic events from textual medical sources. We conduct extensive experiments to assess the performance of ChatGPT in the pharmacovigilance event extraction task, employing various prompts and demonstration selection strategies. The findings demonstrate that while ChatGPT demonstrates reasonable performance with appropriate demonstration selection strategies, it still falls short compared to fully fine-tuned small models. Additionally, we explore the potential of leveraging ChatGPT for data augmentation. However, our investigation reveals that the inclusion of synthesized data into fine-tuning may lead to a decrease in performance, possibly attributed to noise in the ChatGPT-generated labels. To mitigate this, we explore different filtering strategies and find that, with the proper approach, more stable performance can be achieved, although constant improvement remains elusive.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence
Authors:
Kundan Krishna,
Sanjana Ramprasad,
Prakhar Gupta,
Byron C. Wallace,
Zachary C. Lipton,
Jeffrey P. Bigham
Abstract:
LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that ar…
▽ More
LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. User studies demonstrate that using GenAudit can substantially improve the performance of humans at finding errors in LLM-generated summaries. We release our tool (GenAudit) and fact-checking model for public use.
△ Less
Submitted 19 January, 2025; v1 submitted 19 February, 2024;
originally announced February 2024.
-
FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence
Authors:
Sebastian Antony Joseph,
Lily Chen,
Jan Trienes,
Hannah Louisa Göke,
Monika Coers,
Wei Xu,
Byron C Wallace,
Junyi Jessy Li
Abstract:
Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly…
▽ More
Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly inform patient treatment. FactPICO consists of 345 plain language summaries of RCT abstracts generated from three LLMs (i.e., GPT-4, Llama-2, and Alpaca), with fine-grained evaluation and natural language rationales from experts. We assess the factuality of critical elements of RCTs in those summaries: Populations, Interventions, Comparators, Outcomes (PICO), as well as the reported findings concerning these. We also evaluate the correctness of the extra information (e.g., explanations) added by LLMs. Using FactPICO, we benchmark a range of existing factuality metrics, including the newly devised ones based on LLMs. We find that plain language summarization of medical evidence is still challenging, especially when balancing between simplicity and factuality, and that existing metrics correlate poorly with expert judgments on the instance level.
△ Less
Submitted 4 June, 2024; v1 submitted 17 February, 2024;
originally announced February 2024.
-
Towards Reducing Diagnostic Errors with Interpretable Risk Prediction
Authors:
Denis Jered McInerney,
William Dickinson,
Lucy C. Flynn,
Andrea C. Young,
Geoffrey S. Young,
Jan-Willem van de Meent,
Byron C. Wallace
Abstract:
Many diagnostic errors occur because clinicians cannot easily access relevant information in patient Electronic Health Records (EHRs). In this work we propose a method to use LLMs to identify pieces of evidence in patient EHR data that indicate increased or decreased risk of specific diagnoses; our ultimate aim is to increase access to evidence and reduce diagnostic errors. In particular, we propo…
▽ More
Many diagnostic errors occur because clinicians cannot easily access relevant information in patient Electronic Health Records (EHRs). In this work we propose a method to use LLMs to identify pieces of evidence in patient EHR data that indicate increased or decreased risk of specific diagnoses; our ultimate aim is to increase access to evidence and reduce diagnostic errors. In particular, we propose a Neural Additive Model to make predictions backed by evidence with individualized risk estimates at time-points where clinicians are still uncertain, aiming to specifically mitigate delays in diagnosis and errors stemming from an incomplete differential. To train such a model, it is necessary to infer temporally fine-grained retrospective labels of eventual "true" diagnoses. We do so with LLMs, to ensure that the input text is from before a confident diagnosis can be made. We use an LLM to retrieve an initial pool of evidence, but then refine this set of evidence according to correlations learned by the model. We conduct an in-depth evaluation of the usefulness of our approach by simulating how it might be used by a clinician to decide between a pre-defined list of differential diagnoses.
△ Less
Submitted 19 March, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains
Authors:
Sanjana Ramprasad,
Kundan Krishna,
Zachary C Lipton,
Byron C Wallace
Abstract:
Recent work has shown that large language models (LLMs) are capable of generating summaries zero-shot (i.e., without explicit supervision) that, under human assessment, are often comparable or even preferred to manually composed reference summaries. However, this prior work has focussed almost exclusively on evaluating news article summarization. How do zero-shot summarizers perform in other (pote…
▽ More
Recent work has shown that large language models (LLMs) are capable of generating summaries zero-shot (i.e., without explicit supervision) that, under human assessment, are often comparable or even preferred to manually composed reference summaries. However, this prior work has focussed almost exclusively on evaluating news article summarization. How do zero-shot summarizers perform in other (potentially more specialized) domains? In this work we evaluate zero-shot generated summaries across specialized domains including biomedical articles, and legal bills (in addition to standard news benchmarks for reference). We focus especially on the factuality of outputs. We acquire annotations from domain experts to identify inconsistencies in summaries and systematically categorize these errors. We analyze whether the prevalence of a given domain in the pretraining corpus affects extractiveness and faithfulness of generated summaries of articles in this domain. We release all collected annotations to facilitate additional research toward measuring and realizing factually accurate summarization, beyond news articles. The dataset can be downloaded from https://github.com/sanjanaramprasad/zero_shot_faceval_domains
△ Less
Submitted 5 February, 2024;
originally announced February 2024.
-
Question answering systems for health professionals at the point of care -- a systematic review
Authors:
Gregory Kell,
Angus Roberts,
Serge Umansky,
Linglong Qian,
Davide Ferrari,
Frank Soboczenski,
Byron Wallace,
Nikhil Patel,
Iain J Marshall
Abstract:
Objective: Question answering (QA) systems have the potential to improve the quality of clinical care by providing health professionals with the latest and most relevant evidence. However, QA systems have not been widely adopted. This systematic review aims to characterize current medical QA systems, assess their suitability for healthcare, and identify areas of improvement.
Materials and method…
▽ More
Objective: Question answering (QA) systems have the potential to improve the quality of clinical care by providing health professionals with the latest and most relevant evidence. However, QA systems have not been widely adopted. This systematic review aims to characterize current medical QA systems, assess their suitability for healthcare, and identify areas of improvement.
Materials and methods: We searched PubMed, IEEE Xplore, ACM Digital Library, ACL Anthology and forward and backward citations on 7th February 2023. We included peer-reviewed journal and conference papers describing the design and evaluation of biomedical QA systems. Two reviewers screened titles, abstracts, and full-text articles. We conducted a narrative synthesis and risk of bias assessment for each study. We assessed the utility of biomedical QA systems.
Results: We included 79 studies and identified themes, including question realism, answer reliability, answer utility, clinical specialism, systems, usability, and evaluation methods. Clinicians' questions used to train and evaluate QA systems were restricted to certain sources, types and complexity levels. No system communicated confidence levels in the answers or sources. Many studies suffered from high risks of bias and applicability concerns. Only 8 studies completely satisfied any criterion for clinical utility, and only 7 reported user evaluations. Most systems were built with limited input from clinicians.
Discussion: While machine learning methods have led to increased accuracy, most studies imperfectly reflected real-world healthcare information needs. Key research priorities include developing more realistic healthcare QA datasets and considering the reliability of answer sources, rather than merely focusing on accuracy.
△ Less
Submitted 24 January, 2024;
originally announced February 2024.
-
Extreme emission line galaxies detected in JADES JWST/NIRSpec I: inferred galaxy properties
Authors:
Kit Boyett,
Andrew J. Bunker,
Emma Curtis-Lake,
Jacopo Chevallard,
Alex J. Cameron,
Gareth C. Jones,
Aayush Saxena,
Stéphane Charlot,
Mirko Curti,
Imaan E. B. Wallace,
Santiago Arribas,
Stefano Carniani,
Chris Willott,
Stacey Alberts,
Daniel J. Eisenstein,
Kevin Hainline,
Ryan Hausen,
Benjamin D. Johnson,
Marcia Rieke,
Brant Robertson,
Daniel P. Stark,
Sandro Tacchella,
Christina C. Williams,
Zuyi Chen,
Eiichi Egami
, et al. (11 additional authors not shown)
Abstract:
Extreme emission line galaxies (EELGs) exhibit large equivalent widths (EW) in their rest-optical emission lines ([OIII]$\lambda5007$ or H$α$ rest-frame EW$ > 750Å$) which can be tied to a recent upturn in star formation rate, due to the sensitivity of the nebular line emission and the rest-optical continuum to young ($<10$Myr) and evolved stellar populations, respectively. By studying a sample of…
▽ More
Extreme emission line galaxies (EELGs) exhibit large equivalent widths (EW) in their rest-optical emission lines ([OIII]$\lambda5007$ or H$α$ rest-frame EW$ > 750Å$) which can be tied to a recent upturn in star formation rate, due to the sensitivity of the nebular line emission and the rest-optical continuum to young ($<10$Myr) and evolved stellar populations, respectively. By studying a sample of 85 star forming galaxies (SFGs), spanning the redshift and magnitude interval $3 <z<9.5$ and $-16>$ M$_{UV}>-21$, in the JWST Advanced Deep Extragalactic Survey (JADES) with NIRSpec/prism spectroscopy, we determine that SFGs initiate an EELG phase when entering a significant burst of star formation, with the highest EWs observed in EELGs with the youngest luminosity-weighted ages ($<5$ Myr old) and the highest burst intensity (those with the greatest excess between their current and long-term average SFR). We spectroscopically confirm that a greater proportion of SFGs are in an EELG phase at high redshift in our UV-selected sample ($61\pm4\%$ in our $z>5.7$ high-redshift bin, compared to $23^{+4}_{-1}\%$ in our lowest-redshift bin $3<z<4.1$) due to the combined evolution of metallicity, ionisation parameter and star formation histories with redshift. We report that the EELGs within our sample exhibit a higher average ionisation efficiency ($\log_{10}(ξ_{ion}^{HII}/$erg$^{-1}$Hz)$=25.5\pm0.2$) than the non-EELGs. High-redshift EELGs therefore comprise a population of efficient ionising photon producers. Additionally, we report that $53\%$ (9/17) of EELGs at $z>5.7$ have observed Lyman-$α$ emission, potentially lying within large ionised regions. The high detection rate of Lyman-$α$ emitters in our EELG selection suggests that the physical conditions associated with entering an EELG phase also promote the escape of Lyman-$α$ photons.
△ Less
Submitted 23 October, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification
Authors:
Jan Trienes,
Sebastian Joseph,
Jörg Schlötterer,
Christin Seifert,
Kyle Lo,
Wei Xu,
Byron C. Wallace,
Junyi Jessy Li
Abstract:
Text simplification aims to make technical texts more accessible to laypeople but often results in deletion of information and vagueness. This work proposes InfoLossQA, a framework to characterize and recover simplification-induced information loss in form of question-and-answer (QA) pairs. Building on the theory of Question Under Discussion, the QA pairs are designed to help readers deepen their…
▽ More
Text simplification aims to make technical texts more accessible to laypeople but often results in deletion of information and vagueness. This work proposes InfoLossQA, a framework to characterize and recover simplification-induced information loss in form of question-and-answer (QA) pairs. Building on the theory of Question Under Discussion, the QA pairs are designed to help readers deepen their knowledge of a text. We conduct a range of experiments with this framework. First, we collect a dataset of 1,000 linguist-curated QA pairs derived from 104 LLM simplifications of scientific abstracts of medical studies. Our analyses of this data reveal that information loss occurs frequently, and that the QA pairs give a high-level overview of what information was lost. Second, we devise two methods for this task: end-to-end prompting of open-source and commercial language models, and a natural language inference pipeline. With a novel evaluation framework considering the correctness of QA pairs and their linguistic suitability, our expert evaluation reveals that models struggle to reliably identify information loss and applying similar standards as humans at what constitutes information loss.
△ Less
Submitted 4 June, 2024; v1 submitted 29 January, 2024;
originally announced January 2024.
-
MedISure: Towards Assuring Machine Learning-based Medical Image Classifiers using Mixup Boundary Analysis
Authors:
Adam Byfield,
William Poulett,
Ben Wallace,
Anusha Jose,
Shatakshi Tyagi,
Smita Shembekar,
Adnan Qayyum,
Junaid Qadir,
Muhammad Bilal
Abstract:
Machine learning (ML) models are becoming integral in healthcare technologies, presenting a critical need for formal assurance to validate their safety, fairness, robustness, and trustworthiness. These models are inherently prone to errors, potentially posing serious risks to patient health and could even cause irreparable harm. Traditional software assurance techniques rely on fixed code and do n…
▽ More
Machine learning (ML) models are becoming integral in healthcare technologies, presenting a critical need for formal assurance to validate their safety, fairness, robustness, and trustworthiness. These models are inherently prone to errors, potentially posing serious risks to patient health and could even cause irreparable harm. Traditional software assurance techniques rely on fixed code and do not directly apply to ML models since these algorithms are adaptable and learn from curated datasets through a training process. However, adapting established principles, such as boundary testing using synthetic test data can effectively bridge this gap. To this end, we present a novel technique called Mix-Up Boundary Analysis (MUBA) that facilitates evaluating image classifiers in terms of prediction fairness. We evaluated MUBA for two important medical imaging tasks -- brain tumour classification and breast cancer classification -- and achieved promising results. This research aims to showcase the importance of adapting traditional assurance principles for assessing ML models to enhance the safety and reliability of healthcare technologies. To facilitate future research, we plan to publicly release our code for MUBA.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
Diffusion Model Alignment Using Direct Preference Optimization
Authors:
Bram Wallace,
Meihua Dang,
Rafael Rafailov,
Linqi Zhou,
Aaron Lou,
Senthil Purushwalkam,
Stefano Ermon,
Caiming Xiong,
Shafiq Joty,
Nikhil Naik
Abstract:
Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality im…
▽ More
Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO, a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO), a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood, utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation, improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences, opening the door for scaling of diffusion model alignment methods.
△ Less
Submitted 21 November, 2023;
originally announced November 2023.
-
Leveraging Generative AI for Clinical Evidence Summarization Needs to Ensure Trustworthiness
Authors:
Gongbo Zhang,
Qiao Jin,
Denis Jered McInerney,
Yong Chen,
Fei Wang,
Curtis L. Cole,
Qian Yang,
Yanshan Wang,
Bradley A. Malin,
Mor Peleg,
Byron C. Wallace,
Zhiyong Lu,
Chunhua Weng,
Yifan Peng
Abstract:
Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, ho…
▽ More
Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, hold promise in facilitating the arduous task. However, developing accountable, fair, and inclusive models remains a complicated undertaking. In this perspective, we discuss the trustworthiness of generative AI in the context of automated summarization of medical evidence.
△ Less
Submitted 31 March, 2024; v1 submitted 18 November, 2023;
originally announced November 2023.
-
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
Authors:
Koyena Pal,
Jiuding Sun,
Andrew Yuan,
Byron C. Wallace,
David Bau
Abstract:
We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position $t$ in an input, can we reliably anticipate the tokens that will appear at positions $\geq t + 2$? To test this, we measure linear appr…
▽ More
We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position $t$ in an input, can we reliably anticipate the tokens that will appear at positions $\geq t + 2$? To test this, we measure linear approximation and causal intervention methods in GPT-J-6B to evaluate the degree to which individual hidden states in the network contain signal rich enough to predict future hidden states and, ultimately, token outputs. We find that, at some layers, we can approximate a model's output with more than 48% accuracy with respect to its prediction of subsequent tokens through a single hidden state. Finally we present a "Future Lens" visualization that uses these methods to create a new view of transformer states.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Function Vectors in Large Language Models
Authors:
Eric Todd,
Millicent L. Li,
Arnab Sen Sharma,
Aaron Mueller,
Byron C. Wallace,
David Bau
Abstract:
We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV). FVs are…
▽ More
We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV). FVs are robust to changes in context, i.e., they trigger execution of the task on inputs such as zero-shot and natural text settings that do not resemble the ICL contexts from which they are collected. We test FVs across a range of tasks, models, and layers and find strong causal effects across settings in middle layers. We investigate the internal structure of FVs and find while that they often contain information that encodes the output space of the function, this information alone is not sufficient to reconstruct an FV. Finally, we test semantic vector composition in FVs, and find that to some extent they can be summed to create vectors that trigger new complex tasks. Our findings show that compact, causal internal vector representations of function abstractions can be explicitly extracted from LLMs. Our code and data are available at https://functions.baulab.info.
△ Less
Submitted 25 February, 2024; v1 submitted 23 October, 2023;
originally announced October 2023.
-
Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges
Authors:
Hiba Ahsan,
Denis Jered McInerney,
Jisoo Kim,
Christopher Potter,
Geoffrey Young,
Silvio Amir,
Byron C. Wallace
Abstract:
Unstructured data in Electronic Health Records (EHRs) often contains critical information -- complementary to imaging -- that could inform radiologists' diagnoses. But the large volume of notes often associated with patients together with time constraints renders manually identifying relevant evidence practically infeasible. In this work we propose and evaluate a zero-shot strategy for using LLMs…
▽ More
Unstructured data in Electronic Health Records (EHRs) often contains critical information -- complementary to imaging -- that could inform radiologists' diagnoses. But the large volume of notes often associated with patients together with time constraints renders manually identifying relevant evidence practically infeasible. In this work we propose and evaluate a zero-shot strategy for using LLMs as a mechanism to efficiently retrieve and summarize unstructured evidence in patient EHR relevant to a given query. Our method entails tasking an LLM to infer whether a patient has, or is at risk of, a particular condition on the basis of associated notes; if so, we ask the model to summarize the supporting evidence. Under expert evaluation, we find that this LLM-based approach provides outputs consistently preferred to a pre-LLM information retrieval baseline. Manual evaluation is expensive, so we also propose and validate a method using an LLM to evaluate (other) LLM outputs for this task, allowing us to scale up evaluation. Our findings indicate the promise of LLMs as interfaces to EHR, but also highlight the outstanding challenge posed by "hallucinations". In this setting, however, we show that model confidence in outputs strongly correlates with faithful summaries, offering a practical means to limit confabulations.
△ Less
Submitted 10 June, 2024; v1 submitted 8 September, 2023;
originally announced September 2023.
-
Modulation-Enhanced Excitation for Continuous-Time Reinforcement Learning via Symmetric Kronecker Products
Authors:
Brent A. Wallace,
Jennie Si
Abstract:
This work introduces new results in continuous-time reinforcement learning (CT-RL) control of affine nonlinear systems to address a major algorithmic challenge due to a lack of persistence of excitation (PE). This PE design limitation has previously stifled CT-RL numerical performance and prevented these algorithms from achieving control synthesis goals. Our new theoretical developments in symmetr…
▽ More
This work introduces new results in continuous-time reinforcement learning (CT-RL) control of affine nonlinear systems to address a major algorithmic challenge due to a lack of persistence of excitation (PE). This PE design limitation has previously stifled CT-RL numerical performance and prevented these algorithms from achieving control synthesis goals. Our new theoretical developments in symmetric Kronecker products enable a proposed modulation-enhanced excitation (MEE) framework to make PE significantly more systematic and intuitive to achieve for real-world designers. MEE is applied to the suite of recently-developed excitable integral reinforcement learning (EIRL) algorithms, yielding a class of enhanced high-performance CT-RL control design methods which, due to the symmetric Kronecker product algebra, retain EIRL's convergence and closed-loop stability guarantees. Through numerical evaluation studies, we demonstrate how our new MEE framework achieves substantial improvements in conditioning when approximately solving the Hamilton-Jacobi-Bellman equation to obtain optimal controls. We use an intuitive example to provide insights on the central excitation issue under discussion, and we demonstrate the effectiveness of the proposed procedure on a real-world hypersonic vehicle (HSV) application.
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
Continuous-Time Reinforcement Learning: New Design Algorithms with Theoretical Insights and Performance Guarantees
Authors:
Brent A. Wallace,
Jennie Si
Abstract:
Continuous-time nonlinear optimal control problems hold great promise in real-world applications. After decades of development, reinforcement learning (RL) has achieved some of the greatest successes as a general nonlinear control design method. However, a recent comprehensive analysis of state-of-the-art continuous-time RL (CT-RL) methods, namely, adaptive dynamic programming (ADP)-based CT-RL al…
▽ More
Continuous-time nonlinear optimal control problems hold great promise in real-world applications. After decades of development, reinforcement learning (RL) has achieved some of the greatest successes as a general nonlinear control design method. However, a recent comprehensive analysis of state-of-the-art continuous-time RL (CT-RL) methods, namely, adaptive dynamic programming (ADP)-based CT-RL algorithms, reveals they face significant design challenges due to their complexity, numerical conditioning, and dimensional scaling issues. Despite advanced theoretical results, existing ADP CT-RL synthesis methods are inadequate in solving even small, academic problems. The goal of this work is thus to introduce a suite of new CT-RL algorithms for control of affine nonlinear systems. Our design approach relies on two important factors. First, our methods are applicable to physical systems that can be partitioned into smaller subproblems. This constructive consideration results in reduced dimensionality and greatly improved intuitiveness of design. Second, we introduce a new excitation framework to improve persistence of excitation (PE) and numerical conditioning performance via classical input/output insights. Such a design-centric approach is the first of its kind in the ADP CT-RL community. In this paper, we progressively introduce a suite of (decentralized) excitable integral reinforcement learning (EIRL) algorithms. We provide convergence and closed-loop stability guarantees, and we demonstrate these guarantees on a significant application problem of controlling an unstable, nonminimum phase hypersonic vehicle (HSV).
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
Evaluating the Zero-shot Robustness of Instruction-tuned Language Models
Authors:
Jiuding Sun,
Chantal Shaib,
Byron C. Wallace
Abstract:
Instruction fine-tuning has recently emerged as a promising approach for improving the zero-shot capabilities of Large Language Models (LLMs) on new tasks. This technique has shown particular strength in improving the performance of modestly sized LLMs, sometimes inducing performance competitive with much larger model variants. In this paper we ask two questions: (1) How sensitive are instruction-…
▽ More
Instruction fine-tuning has recently emerged as a promising approach for improving the zero-shot capabilities of Large Language Models (LLMs) on new tasks. This technique has shown particular strength in improving the performance of modestly sized LLMs, sometimes inducing performance competitive with much larger model variants. In this paper we ask two questions: (1) How sensitive are instruction-tuned models to the particular phrasings of instructions, and, (2) How can we make them more robust to such natural language variation? To answer the former, we collect a set of 319 instructions manually written by NLP practitioners for over 80 unique tasks included in widely used benchmarks, and we evaluate the variance and average performance of these instructions as compared to instruction phrasings observed during instruction fine-tuning. We find that using novel (unobserved) but appropriate instruction phrasings consistently degrades model performance, sometimes substantially so. Further, such natural instructions yield a wide variance in downstream performance, despite their semantic equivalence. Put another way, instruction-tuned models are not especially robust to instruction re-phrasings. We propose a simple method to mitigate this issue by introducing ``soft prompt'' embedding parameters and optimizing these to maximize the similarity between representations of semantically equivalent instructions. We show that this method consistently improves the robustness of instruction-tuned models.
△ Less
Submitted 8 July, 2023; v1 submitted 19 June, 2023;
originally announced June 2023.
-
JADES NIRSpec Initial Data Release for the Hubble Ultra Deep Field: Redshifts and Line Fluxes of Distant Galaxies from the Deepest JWST Cycle 1 NIRSpec Multi-Object Spectroscopy
Authors:
Andrew J. Bunker,
Alex J. Cameron,
Emma Curtis-Lake,
Peter Jakobsen,
Stefano Carniani,
Mirko Curti,
Joris Witstok,
Roberto Maiolino,
Francesco D'Eugenio,
Tobias J. Looser,
Chris Willott,
Nina Bonaventura,
Kevin Hainline,
Hannah Uebler,
Christopher N. A. Willmer,
Aayush Saxena,
Renske Smit,
Stacey Alberts,
Santiago Arribas,
William M. Baker,
Stefi Baum,
Rachana Bhatawdekar,
Rebecca A. A. Bowler,
Kristan Boyett,
Stephane Charlot
, et al. (41 additional authors not shown)
Abstract:
We describe the NIRSpec component of the JWST Deep Extragalactic Survey (JADES), and provide deep spectroscopy of 253 sources targeted with the NIRSpec micro-shutter assembly in the Hubble Ultra Deep Field and surrounding GOODS-South. The multi-object spectra presented here are the deepest so far obtained with JWST, amounting to up to 28 hours in the low-dispersion ($R\sim 30-300$) prism, and up t…
▽ More
We describe the NIRSpec component of the JWST Deep Extragalactic Survey (JADES), and provide deep spectroscopy of 253 sources targeted with the NIRSpec micro-shutter assembly in the Hubble Ultra Deep Field and surrounding GOODS-South. The multi-object spectra presented here are the deepest so far obtained with JWST, amounting to up to 28 hours in the low-dispersion ($R\sim 30-300$) prism, and up to 7 hours in each of the three medium-resolution $R\approx 1000$ gratings and one high-dispersion grating, G395H ($R\approx2700$). Our low-dispersion and medium-dispersion spectra cover the wavelength range $0.6-5.3μ$m. We describe the selection of the spectroscopic targets, the strategy for the allocation of targets to micro-shutters, and the design of the observations. We present the public release of the reduced 2D and 1D spectra, and a description of the reduction and calibration process. We measure spectroscopic redshifts for 178 of the objects targeted extending up to $z=13.2$. We present a catalog of all emission lines detected at $S/N>5$, and our redshift determinations for the targets. Combined with the first JADES NIRCam data release, these public JADES spectroscopic and imaging datasets provide a new foundation for discoveries of the infrared universe by the worldwide scientific community.
△ Less
Submitted 31 May, 2024; v1 submitted 4 June, 2023;
originally announced June 2023.
-
JADES Initial Data Release for the Hubble Ultra Deep Field: Revealing the Faint Infrared Sky with Deep JWST NIRCam Imaging
Authors:
Marcia J. Rieke,
Brant E. Robertson,
Sandro Tacchella,
Kevin Hainline,
Benjamin D. Johnson,
Ryan Hausan,
Zhiyuan Ji,
Christopher N. A. Willmer,
Daniel J. Eisenstein,
Dàvid Puskàs,
Stacey Alberts,
Santiago Arribas,
William M. Baker,
Stefi Baum,
Rachana Bhatawdekar,
Nina Bonaventura,
Kit Boyett,
Andrew Bunker,
Alex J. Cameron,
Stefano Carniani,
Stephane Charlot,
Jacopo Chevallard,
Zuyi Chen,
Mirko Curti,
Emma Curtis-Lake
, et al. (34 additional authors not shown)
Abstract:
JWST has revolutionized the field of extragalactic astronomy with its sensitive and high-resolution infrared view of the distant universe. Adding to the new legacy of JWST observations, we present the first NIRCam imaging data release from the JWST Advanced Deep Extragalactic Survey (JADES) providing 9 filters of infrared imaging of $\sim$25 arcmin$^2$ covering the Hubble Ultra Deep Field and port…
▽ More
JWST has revolutionized the field of extragalactic astronomy with its sensitive and high-resolution infrared view of the distant universe. Adding to the new legacy of JWST observations, we present the first NIRCam imaging data release from the JWST Advanced Deep Extragalactic Survey (JADES) providing 9 filters of infrared imaging of $\sim$25 arcmin$^2$ covering the Hubble Ultra Deep Field and portions of Great Observatories Origins Deep Survey (GOODS) South. Utilizing 87 on-sky dual-filter hours of exposure time, these images reveal the deepest ever near-infrared view of this iconic field. We supply carefully constructed 9-band mosaics of the JADES bands, as well as matching reductions of 5 additional bands from the JWST Extragalactic Medium-band Survey (JEMS). Combining with existing HST imaging, we provide 23-band space-based photometric catalogs and photometric redshifts for $\approx47,500$ sources. To promote broad engagement with the JADES survey, we have created an interactive {\tt FitsMap} website to provide an interface for professional researchers and the public to experience these JWST datasets. Combined with the first JADES NIRSpec data release, these public JADES imaging and spectroscopic datasets provide a new foundation for discoveries of the infrared universe by the worldwide scientific community.
△ Less
Submitted 1 September, 2023; v1 submitted 4 June, 2023;
originally announced June 2023.
-
Overview of the JWST Advanced Deep Extragalactic Survey (JADES)
Authors:
Daniel J. Eisenstein,
Chris Willott,
Stacey Alberts,
Santiago Arribas,
Nina Bonaventura,
Andrew J. Bunker,
Alex J. Cameron,
Stefano Carniani,
Stephane Charlot,
Emma Curtis-Lake,
Francesco D'Eugenio,
Ryan Endsley,
Pierre Ferruit,
Giovanna Giardino,
Kevin Hainline,
Ryan Hausen,
Peter Jakobsen,
Benjamin D. Johnson,
Roberto Maiolino,
Marcia Rieke,
George Rieke,
Hans-Walter Rix,
Brant Robertson,
Daniel P. Stark,
Sandro Tacchella
, et al. (51 additional authors not shown)
Abstract:
We present an overview of the James Webb Space Telescope (JWST) Advanced Deep Extragalactic Survey (JADES), an ambitious program of infrared imaging and spectroscopy in the GOODS-S and GOODS-N deep fields, designed to study galaxy evolution from high redshift to cosmic noon. JADES uses about 770 hours of Cycle 1 guaranteed time largely from the Near-Infrared Camera (NIRCam) and Near-Infrared Spect…
▽ More
We present an overview of the James Webb Space Telescope (JWST) Advanced Deep Extragalactic Survey (JADES), an ambitious program of infrared imaging and spectroscopy in the GOODS-S and GOODS-N deep fields, designed to study galaxy evolution from high redshift to cosmic noon. JADES uses about 770 hours of Cycle 1 guaranteed time largely from the Near-Infrared Camera (NIRCam) and Near-Infrared Spectrograph (NIRSpec) instrument teams. In GOODS-S, in and around the Hubble Ultra Deep Field and Chandra Deep Field South, JADES produces a deep imaging region of ~45 arcmin$^2$ with an average of 130 hrs of exposure time spread over 9 NIRCam filters. This is extended at medium depth in GOODS-S and GOODS-N with NIRCam imaging of ~175 arcmin$^2$ with an average exposure time of 20 hrs spread over 8-10 filters. In both fields, we conduct extensive NIRSpec multi-object spectroscopy, including 2 deep pointings of 55 hrs exposure time, 14 medium pointings of ~12 hrs, and 15 shallower pointings of ~4 hrs, targeting over 5000 HST and JWST-detected faint sources with 5 low, medium, and high-resolution dispersers covering 0.6-5.3 microns. Finally, JADES extends redward via coordinated parallels with the JWST Mid-Infrared Instrument (MIRI), featuring ~9 arcmin$^2$ with 43 hours of exposure at 7.7 microns and twice that area with 2-6.5 hours of exposure at 12.8 microns For nearly 30 years, the GOODS-S and GOODS-N fields have been developed as the premier deep fields on the sky; JADES is now providing a compelling start on the JWST legacy in these fields.
△ Less
Submitted 4 June, 2023;
originally announced June 2023.
-
USB: A Unified Summarization Benchmark Across Tasks and Domains
Authors:
Kundan Krishna,
Prakhar Gupta,
Sanjana Ramprasad,
Byron C. Wallace,
Jeffrey P. Bigham,
Zachary C. Lipton
Abstract:
While the NLP community has produced numerous summarization benchmarks, none provide the rich annotations required to simultaneously address many important problems related to control and reliability. We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks: (i) extractive summarization; (ii) abstractive summarization…
▽ More
While the NLP community has produced numerous summarization benchmarks, none provide the rich annotations required to simultaneously address many important problems related to control and reliability. We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks: (i) extractive summarization; (ii) abstractive summarization; (iii) topic-based summarization; (iv) compressing selected sentences into a one-line summary; (v) surfacing evidence for a summary sentence; (vi) predicting the factual accuracy of a summary sentence; (vii) identifying unsubstantiated spans in a summary sentence; (viii) correcting factual errors in summaries. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models. For factuality-related tasks, we also evaluate existing heuristics to create training data and find that training on them results in worse performance than training on $20\times$ less human-labeled data. Our articles draw from $6$ domains, facilitating cross-domain analysis. On some tasks, the amount of training data matters more than the domain where it comes from, while for other tasks training specifically on data from the target domain, even if limited, is more beneficial.
△ Less
Submitted 4 December, 2023; v1 submitted 23 May, 2023;
originally announced May 2023.
-
Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations
Authors:
Lucy Lu Wang,
Yulia Otmakhova,
Jay DeYoung,
Thinh Hung Truong,
Bailey E. Kuehl,
Erin Bransom,
Byron C. Wallace
Abstract:
Evaluating multi-document summarization (MDS) quality is difficult. This is especially true in the case of MDS for biomedical literature reviews, where models must synthesize contradicting evidence reported across different documents. Prior work has shown that rather than performing the task, models may exploit shortcuts that are difficult to detect using standard n-gram similarity metrics such as…
▽ More
Evaluating multi-document summarization (MDS) quality is difficult. This is especially true in the case of MDS for biomedical literature reviews, where models must synthesize contradicting evidence reported across different documents. Prior work has shown that rather than performing the task, models may exploit shortcuts that are difficult to detect using standard n-gram similarity metrics such as ROUGE. Better automated evaluation metrics are needed, but few resources exist to assess metrics when they are proposed. Therefore, we introduce a dataset of human-assessed summary quality facets and pairwise preferences to encourage and support the development of better automated evaluation methods for literature review MDS. We take advantage of community submissions to the Multi-document Summarization for Literature Review (MSLR) shared task to compile a diverse and representative sample of generated summaries. We analyze how automated summarization evaluation metrics correlate with lexical features of generated summaries, to other automated metrics including several we propose in this work, and to aspects of human-assessed summary quality. We find that not only do automated metrics fail to capture aspects of quality as assessed by humans, in many cases the system rankings produced by these metrics are anti-correlated with rankings according to human annotators.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
Multilingual Simplification of Medical Texts
Authors:
Sebastian Joseph,
Kathryn Kazanas,
Keziah Reina,
Vishnesh J. Ramanathan,
Wei Xu,
Byron C. Wallace,
Junyi Jessy Li
Abstract:
Automated text simplification aims to produce simple versions of complex texts. This task is especially useful in the medical domain, where the latest medical findings are typically communicated via complex and technical articles. This creates barriers for laypeople seeking access to up-to-date medical findings, consequently impeding progress on health literacy. Most existing work on medical text…
▽ More
Automated text simplification aims to produce simple versions of complex texts. This task is especially useful in the medical domain, where the latest medical findings are typically communicated via complex and technical articles. This creates barriers for laypeople seeking access to up-to-date medical findings, consequently impeding progress on health literacy. Most existing work on medical text simplification has focused on monolingual settings, with the result that such evidence would be available only in just one language (most often, English). This work addresses this limitation via multilingual simplification, i.e., directly simplifying complex texts into simplified texts in multiple languages. We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages: English, Spanish, French, and Farsi. We evaluate fine-tuned and zero-shot models across these languages, with extensive human assessments and analyses. Although models can now generate viable simplified texts, we identify outstanding challenges that this dataset might be used to address.
△ Less
Submitted 18 October, 2023; v1 submitted 21 May, 2023;
originally announced May 2023.
-
Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews
Authors:
Hye Sun Yun,
Iain J. Marshall,
Thomas A. Trikalinos,
Byron C. Wallace
Abstract:
Medical systematic reviews play a vital role in healthcare decision making and policy. However, their production is time-consuming, limiting the availability of high-quality and up-to-date evidence summaries. Recent advancements in large language models (LLMs) offer the potential to automatically generate literature reviews on demand, addressing this issue. However, LLMs sometimes generate inaccur…
▽ More
Medical systematic reviews play a vital role in healthcare decision making and policy. However, their production is time-consuming, limiting the availability of high-quality and up-to-date evidence summaries. Recent advancements in large language models (LLMs) offer the potential to automatically generate literature reviews on demand, addressing this issue. However, LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission. In healthcare, this can make LLMs unusable at best and dangerous at worst. We conducted 16 interviews with international systematic review experts to characterize the perceived utility and risks of LLMs in the specific context of medical evidence reviews. Experts indicated that LLMs can assist in the writing process by drafting summaries, generating templates, distilling information, and crosschecking information. They also raised concerns regarding confidently composed but inaccurate LLM outputs and other potential downstream harms, including decreased accountability and proliferation of low-quality reviews. Informed by this qualitative analysis, we identify criteria for rigorous evaluation of biomedical LLMs aligned with domain expert views.
△ Less
Submitted 18 October, 2023; v1 submitted 19 May, 2023;
originally announced May 2023.
-
Summarizing, Simplifying, and Synthesizing Medical Evidence Using GPT-3 (with Varying Success)
Authors:
Chantal Shaib,
Millicent L. Li,
Sebastian Joseph,
Iain J. Marshall,
Junyi Jessy Li,
Byron C. Wallace
Abstract:
Large language models, particularly GPT-3, are able to produce high quality summaries of general domain news articles in few- and zero-shot settings. However, it is unclear if such models are similarly capable in more specialized, high-stakes domains such as biomedicine. In this paper, we enlist domain experts (individuals with medical training) to evaluate summaries of biomedical articles generat…
▽ More
Large language models, particularly GPT-3, are able to produce high quality summaries of general domain news articles in few- and zero-shot settings. However, it is unclear if such models are similarly capable in more specialized, high-stakes domains such as biomedicine. In this paper, we enlist domain experts (individuals with medical training) to evaluate summaries of biomedical articles generated by GPT-3, given zero supervision. We consider both single- and multi-document settings. In the former, GPT-3 is tasked with generating regular and plain-language summaries of articles describing randomized controlled trials; in the latter, we assess the degree to which GPT-3 is able to \emph{synthesize} evidence reported across a collection of articles. We design an annotation scheme for evaluating model outputs, with an emphasis on assessing the factual accuracy of generated summaries. We find that while GPT-3 is able to summarize and simplify single biomedical articles faithfully, it struggles to provide accurate aggregations of findings over multiple documents. We release all data and annotations used in this work.
△ Less
Submitted 11 May, 2023; v1 submitted 10 May, 2023;
originally announced May 2023.
-
Revisiting Relation Extraction in the era of Large Language Models
Authors:
Somin Wadhwa,
Silvio Amir,
Byron C. Wallace
Abstract:
Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a \emph{sequence-to-sequence} task, linearizing relations between entities as target strings to be…
▽ More
Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a \emph{sequence-to-sequence} task, linearizing relations between entities as target strings to be generated conditioned on the input. Here we push the limits of this approach, using larger language models (GPT-3 and Flan-T5 large) than considered in prior work and evaluating their performance on standard RE tasks under varying levels of supervision. We address issues inherent to evaluating generative approaches to RE by doing human evaluations, in lieu of relying on exact matching. Under this refined evaluation, we find that: (1) Few-shot prompting with GPT-3 achieves near SOTA performance, i.e., roughly equivalent to existing fully supervised models; (2) Flan-T5 is not as capable in the few-shot setting, but supervising and fine-tuning it with Chain-of-Thought (CoT) style explanations (generated via GPT-3) yields SOTA results. We release this model as a new baseline for RE tasks.
△ Less
Submitted 16 July, 2024; v1 submitted 8 May, 2023;
originally announced May 2023.
-
Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs
Authors:
Somin Wadhwa,
Jay DeYoung,
Benjamin Nye,
Silvio Amir,
Byron C. Wallace
Abstract:
Results from Randomized Controlled Trials (RCTs) establish the comparative effectiveness of interventions, and are in turn critical inputs for evidence-based care. However, results from RCTs are presented in (often unstructured) natural language articles describing the design, execution, and outcomes of trials; clinicians must manually extract findings pertaining to interventions and outcomes of i…
▽ More
Results from Randomized Controlled Trials (RCTs) establish the comparative effectiveness of interventions, and are in turn critical inputs for evidence-based care. However, results from RCTs are presented in (often unstructured) natural language articles describing the design, execution, and outcomes of trials; clinicians must manually extract findings pertaining to interventions and outcomes of interest from such articles. This onerous manual process has motivated work on (semi-)automating extraction of structured evidence from trial reports. In this work we propose and evaluate a text-to-text model built on instruction-tuned Large Language Models (LLMs) to jointly extract Interventions, Outcomes, and Comparators (ICO elements) from clinical abstracts, and infer the associated results reported. Manual (expert) and automated evaluations indicate that framing evidence extraction as a conditional generation task and fine-tuning LLMs for this purpose realizes considerable ($\sim$20 point absolute F1 score) gains over the previous SOTA. We perform ablations and error analyses to assess aspects that contribute to model performance, and to highlight potential directions for further improvements. We apply our model to a collection of published RCTs through mid-2022, and release a searchable database of structured findings: http://ico-relations.ebm-nlp.com
△ Less
Submitted 17 July, 2023; v1 submitted 5 May, 2023;
originally announced May 2023.
-
End-to-End Diffusion Latent Optimization Improves Classifier Guidance
Authors:
Bram Wallace,
Akash Gokul,
Stefano Ermon,
Nikhil Naik
Abstract:
Classifier guidance -- using the gradients of an image classifier to steer the generations of a diffusion model -- has the potential to dramatically expand the creative control over image generation and editing. However, currently classifier guidance requires either training new noise-aware models to obtain accurate gradients or using a one-step denoising approximation of the final generation, whi…
▽ More
Classifier guidance -- using the gradients of an image classifier to steer the generations of a diffusion model -- has the potential to dramatically expand the creative control over image generation and editing. However, currently classifier guidance requires either training new noise-aware models to obtain accurate gradients or using a one-step denoising approximation of the final generation, which leads to misaligned gradients and sub-optimal control. We highlight this approximation's shortcomings and propose a novel guidance method: Direct Optimization of Diffusion Latents (DOODL), which enables plug-and-play guidance by optimizing diffusion latents w.r.t. the gradients of a pre-trained classifier on the true generated pixels, using an invertible diffusion process to achieve memory-efficient backpropagation. Showcasing the potential of more precise guidance, DOODL outperforms one-step classifier guidance on computational and human evaluation metrics across different forms of guidance: using CLIP guidance to improve generations of complex prompts from DrawBench, using fine-grained visual classifiers to expand the vocabulary of Stable Diffusion, enabling image-conditioned generation with a CLIP visual encoder, and improving image aesthetics using an aesthetic scoring network. Code at https://github.com/salesforce/DOODL.
△ Less
Submitted 31 May, 2023; v1 submitted 23 March, 2023;
originally announced March 2023.