Skip to main content

Showing 1–50 of 144 results for author: Wallace, B

.
  1. arXiv:2506.20876  [pdf, ps, other

    cs.CL

    Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine

    Authors: Sebastian Joseph, Lily Chen, Barry Wei, Michael Mackert, Iain J. Marshall, Paul Pu Liang, Ramez Kouzy, Byron C. Wallace, Junyi Jessy Li

    Abstract: Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet… ▽ More

    Submitted 25 June, 2025; originally announced June 2025.

  2. arXiv:2504.03022  [pdf, other

    cs.CL cs.AI

    The Dual-Route Model of Induction

    Authors: Sheridan Feucht, Eric Todd, Byron Wallace, David Bau

    Abstract: Prior work on in-context copying has shown the existence of induction heads, which attend to and promote individual tokens during copying. In this work we introduce a new type of induction head: concept-level induction heads, which copy entire lexical units instead of individual tokens. Concept induction heads learn to attend to the ends of multi-token words throughout training, working in paralle… ▽ More

    Submitted 3 April, 2025; originally announced April 2025.

    Comments: 36 pages, 39 figures. Code and data at https://dualroute.baulab.info

    ACM Class: I.2.7

  3. arXiv:2502.13319  [pdf, other

    cs.CL

    Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare

    Authors: Hiba Ahsan, Arnab Sen Sharma, Silvio Amir, David Bau, Byron C. Wallace

    Abstract: We know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gende… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

  4. arXiv:2502.07963  [pdf, other

    cs.CL cs.AI

    Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?

    Authors: Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace

    Abstract: Medical research faces well-documented challenges in translating novel treatments into clinical practice. Publishing incentives encourage researchers to present "positive" findings, even when empirical results are equivocal. Consequently, it is well-documented that authors often spin study results, especially in article abstracts. Such spin can influence clinician interpretation of evidence and ma… ▽ More

    Submitted 5 May, 2025; v1 submitted 11 February, 2025; originally announced February 2025.

    Comments: 22 pages, 12 figures, 4 tables, CHIL 2025

  5. arXiv:2502.06659  [pdf, other

    cs.CL

    Who Taught You That? Tracing Teachers in Model Distillation

    Authors: Somin Wadhwa, Chantal Shaib, Silvio Amir, Byron C. Wallace

    Abstract: Model distillation -- using outputs from a large teacher model to teach a small student model -- is a practical means of creating efficient models for a particular task. We ask: Can we identify a students' teacher based on its outputs? Such "footprints" left by teacher LLMs would be interesting artifacts. Beyond this, reliable teacher inference may have practical implications as actors seek to dis… ▽ More

    Submitted 20 May, 2025; v1 submitted 10 February, 2025; originally announced February 2025.

    Comments: Findings of ACL 2025

  6. arXiv:2411.16638  [pdf, other

    cs.CL cs.AI

    Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation

    Authors: Sanjana Ramprasad, Byron C. Wallace

    Abstract: Modern LLMs can now produce highly readable abstractive summaries, to the point where traditional automated metrics for evaluating summary quality, such as ROUGE, have become saturated. However, LLMs still sometimes introduce unwanted content into summaries, i.e., information inconsistent with or unsupported by their source. Measuring the occurrence of these often subtle ``hallucinations'' automat… ▽ More

    Submitted 28 November, 2024; v1 submitted 25 November, 2024; originally announced November 2024.

  7. arXiv:2411.05697  [pdf, other

    eess.IV cs.DC cs.LG

    IPMN Risk Assessment under Federated Learning Paradigm

    Authors: Hongyi Pan, Ziliang Hong, Gorkem Durak, Elif Keles, Halil Ertugrul Aktas, Yavuz Taktak, Alpay Medetalibeyoglu, Zheyuan Zhang, Yury Velichko, Concetto Spampinato, Ivo Schoots, Marco J. Bruno, Pallavi Tiwari, Candice Bolan, Tamas Gonda, Frank Miller, Rajesh N. Keswani, Michael B. Wallace, Ziyue Xu, Ulas Bagci

    Abstract: Accurate classification of Intraductal Papillary Mucinous Neoplasms (IPMN) is essential for identifying high-risk cases that require timely intervention. In this study, we develop a federated learning framework for multi-center IPMN classification utilizing a comprehensive pancreas MRI dataset. This dataset includes 652 T1-weighted and 655 T2-weighted MRI images, accompanied by corresponding IPMN… ▽ More

    Submitted 22 January, 2025; v1 submitted 8 November, 2024; originally announced November 2024.

    Comments: This paper has been accepted to ISBI 2025

  8. arXiv:2410.22530  [pdf, other

    eess.IV cs.CV cs.DC

    Adaptive Aggregation Weights for Federated Segmentation of Pancreas MRI

    Authors: Hongyi Pan, Gorkem Durak, Zheyuan Zhang, Yavuz Taktak, Elif Keles, Halil Ertugrul Aktas, Alpay Medetalibeyoglu, Yury Velichko, Concetto Spampinato, Ivo Schoots, Marco J. Bruno, Rajesh N. Keswani, Pallavi Tiwari, Candice Bolan, Tamas Gonda, Michael G. Goggins, Michael B. Wallace, Ziyue Xu, Ulas Bagci

    Abstract: Federated learning (FL) enables collaborative model training across institutions without sharing sensitive data, making it an attractive solution for medical imaging tasks. However, traditional FL methods, such as Federated Averaging (FedAvg), face difficulties in generalizing across domains due to variations in imaging protocols and patient demographics across institutions. This challenge is part… ▽ More

    Submitted 6 May, 2025; v1 submitted 29 October, 2024; originally announced October 2024.

    Comments: This paper has been accepted to ISBI 2025

  9. arXiv:2410.18193  [pdf, ps, other

    astro-ph.GA

    Characterising the z $\sim$ 7.66 Type-II AGN candidate SMACS S06355 using BEAGLE-AGN and JWST NIRSpec/NIRCam

    Authors: M. S. Silcock, E. Curtis-Lake, D. J. B. Smith, I. E. B. Wallace, A. Vidal-García, A. Plat, M. Hirschmann, A. Feltre, J. Chevallard, S. Charlot, S. Carniani, A. J. Bunker

    Abstract: The presence of Active Galactic Nuclei (AGN) in low mass (Mstar $\lesssim$ $10^{9}$ Msun) galaxies at high redshift has been established, and it is important to characterise these objects and the impact of their feedback on the host galaxies. In this paper we apply the Spectral Energy Distribution (SED) fitting code BEAGLE-AGN to SMACS S06355, a z $\sim$ 7.66 Type-II AGN candidate from the JWST NI… ▽ More

    Submitted 27 June, 2025; v1 submitted 23 October, 2024; originally announced October 2024.

    Comments: 16 pages, 9 figures

  10. arXiv:2410.08507  [pdf, ps, other

    cs.RO

    Decentralized Uncertainty-Aware Active Search with a Team of Aerial Robots

    Authors: Wennie Tabib, John Stecklein, Caleb McDowell, Kshitij Goel, Felix Jonathan, Abhishek Rathod, Meghan Kokoski, Edsel Burkholder, Brian Wallace, Luis Ernesto Navarro-Serment, Nikhil Angad Bakshi, Tejus Gupta, Norman Papernick, David Guttendorf, Erik E. Kahn, Jessica Kasemer, Jesse Holdaway, Jeff Schneider

    Abstract: Rapid search and rescue is critical to maximizing survival rates following natural disasters. However, these efforts are challenged by the need to search large disaster zones, lack of reliability in the communications infrastructure, and a priori unknown numbers of objects of interest (OOIs), such as injured survivors. Aerial robots are increasingly being deployed for search and rescue due to thei… ▽ More

    Submitted 10 June, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

    Comments: accepted at ISER 2025

  11. arXiv:2407.19284  [pdf, other

    eess.IV cs.CV

    Optimizing Synthetic Data for Enhanced Pancreatic Tumor Segmentation

    Authors: Linkai Peng, Zheyuan Zhang, Gorkem Durak, Frank H. Miller, Alpay Medetalibeyoglu, Michael B. Wallace, Ulas Bagci

    Abstract: Pancreatic cancer remains one of the leading causes of cancer-related mortality worldwide. Precise segmentation of pancreatic tumors from medical images is a bottleneck for effective clinical decision-making. However, achieving a high accuracy is often limited by the small size and availability of real patient data for training deep learning models. Recent approaches have employed synthetic data g… ▽ More

    Submitted 1 October, 2024; v1 submitted 27 July, 2024; originally announced July 2024.

    Comments: MICCAI Workshop AIPAD 2024

  12. arXiv:2407.14561  [pdf, other

    cs.LG cs.AI

    NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals

    Authors: Jaden Fiotto-Kaufman, Alexander R. Loftus, Eric Todd, Jannik Brinkmann, Koyena Pal, Dmitrii Troitskii, Michael Ripa, Adam Belfki, Can Rager, Caden Juang, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Nikhil Prakash, Carla Brodley, Arjun Guha, Jonathan Bell, Byron C. Wallace, David Bau

    Abstract: We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of the representations and computations learned by very large neural networks. NNsight is an open-source system that extends PyTorch to introduce deferred remote execution. The National Deep Inference Fabric (NDIF) is a scalable inference service that executes NNsight requests, allowing users to share GPU re… ▽ More

    Submitted 1 April, 2025; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: Code at https://nnsight.net

  13. arXiv:2407.09429  [pdf, other

    cs.CL

    Open (Clinical) LLMs are Sensitive to Instruction Phrasings

    Authors: Alberto Mario Ceballos Arroyo, Monica Munnangi, Jiuding Sun, Karen Y. C. Zhang, Denis Jered McInerney, Byron C. Wallace, Silvio Amir

    Abstract: Instruction-tuned Large Language Models (LLMs) can perform a wide range of tasks given natural language instructions to do so, but they are sensitive to how such instructions are phrased. This issue is especially concerning in healthcare, as clinicians are unlikely to be experienced prompt engineers and the potential consequences of inaccurate outputs are heightened in this domain. This raises a… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: To appear at BioNLP, ACL 2024

  14. arXiv:2407.00211  [pdf, other

    cs.CL

    Detection and Measurement of Syntactic Templates in Generated Text

    Authors: Chantal Shaib, Yanai Elazar, Junyi Jessy Li, Byron C. Wallace

    Abstract: Recent work on evaluating the diversity of text generated by LLMs has focused on word-level features. Here we offer an analysis of syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference texts. W… ▽ More

    Submitted 6 October, 2024; v1 submitted 28 June, 2024; originally announced July 2024.

    Comments: EMNLP 2024

  15. arXiv:2406.20086  [pdf, other

    cs.CL cs.LG

    Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs

    Authors: Sheridan Feucht, David Atkinson, Byron Wallace, David Bau

    Abstract: LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantical… ▽ More

    Submitted 11 October, 2024; v1 submitted 28 June, 2024; originally announced June 2024.

    Comments: 13 pages, 14 figures. Code and data at https://footprints.baulab.info/

    ACM Class: I.2.7

  16. arXiv:2406.14511  [pdf, other

    cs.CL

    Investigating Mysteries of CoT-Augmented Distillation

    Authors: Somin Wadhwa, Silvio Amir, Byron C. Wallace

    Abstract: Eliciting "chain of thought" (CoT) rationales -- sequences of token that convey a "reasoning" process -- has been shown to consistently improve LLM performance on tasks like question answering. More recent efforts have shown that such rationales can also be used for model distillation: Including CoT sequences (elicited from a large "teacher" model) in addition to target labels when fine-tuning a s… ▽ More

    Submitted 27 September, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: Accepted to EMNLP 2024

  17. arXiv:2406.09330  [pdf, other

    cs.CL

    Learning from Natural Language Explanations for Generalizable Entity Matching

    Authors: Somin Wadhwa, Adit Krishnan, Runhui Wang, Byron C. Wallace, Chris Kong

    Abstract: Entity matching is the task of linking records from different sources that refer to the same real-world entity. Past work has primarily treated entity linking as a standard supervised learning problem. However, supervised entity matching models often do not generalize well to new data, and collecting exhaustive labeled training data is often cost prohibitive. Further, recent efforts have adopted L… ▽ More

    Submitted 27 September, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted to EMNLP 2024

  18. arXiv:2405.12367  [pdf, other

    eess.IV cs.CV

    Large-Scale Multi-Center CT and MRI Segmentation of Pancreas with Deep Learning

    Authors: Zheyuan Zhang, Elif Keles, Gorkem Durak, Yavuz Taktak, Onkar Susladkar, Vandan Gorade, Debesh Jha, Asli C. Ormeci, Alpay Medetalibeyoglu, Lanhong Yao, Bin Wang, Ilkin Sevgi Isler, Linkai Peng, Hongyi Pan, Camila Lopes Vendrami, Amir Bourhani, Yury Velichko, Boqing Gong, Concetto Spampinato, Ayis Pyrros, Pallavi Tiwari, Derk C. F. Klatte, Megan Engels, Sanne Hoogenboom, Candice W. Bolan , et al. (13 additional authors not shown)

    Abstract: Automated volumetric segmentation of the pancreas on cross-sectional imaging is needed for diagnosis and follow-up of pancreatic diseases. While CT-based pancreatic segmentation is more established, MRI-based segmentation methods are understudied, largely due to a lack of publicly available datasets, benchmarking research efforts, and domain-specific deep learning methods. In this retrospective st… ▽ More

    Submitted 24 October, 2024; v1 submitted 20 May, 2024; originally announced May 2024.

    Comments: Peer-reviewer version

  19. arXiv:2405.01686  [pdf, other

    cs.CL cs.AI

    Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models

    Authors: Hye Sun Yun, David Pogrebitskiy, Iain J. Marshall, Byron C. Wallace

    Abstract: Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individu… ▽ More

    Submitted 24 July, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

    Comments: 25 pages, 7 figures, 6 tables, MLHC 2024

  20. arXiv:2404.00152  [pdf, other

    cs.CL

    On-the-fly Definition Augmentation of LLMs for Biomedical NER

    Authors: Monica Munnangi, Sergey Feldman, Byron C Wallace, Silvio Amir, Tom Hope, Aakanksha Naik

    Abstract: Despite their general capabilities, LLMs still struggle on biomedical NER tasks, which are difficult due to the presence of specialized terminology and lack of training data. In this work we set out to improve LLM performance on biomedical NER in limited data settings via a new knowledge augmentation approach which incorporates definitions of relevant concepts on-the-fly. During this process, to p… ▽ More

    Submitted 23 April, 2024; v1 submitted 29 March, 2024; originally announced April 2024.

    Comments: To appear at NAACL 2024 (Main)

  21. arXiv:2403.00553  [pdf, other

    cs.CL

    Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores

    Authors: Chantal Shaib, Joe Barrow, Jiuding Sun, Alexa F. Siu, Byron C. Wallace, Ani Nenkova

    Abstract: The diversity across outputs generated by LLMs shapes perception of their quality and utility. High lexical diversity is often desirable, but there is no standard method to measure this property. Templated answer structures and ``canned'' responses across different documents are readily noticeable, but difficult to visualize across large corpora. This work aims to standardize measurement of text d… ▽ More

    Submitted 20 March, 2025; v1 submitted 1 March, 2024; originally announced March 2024.

    Comments: Preprint

  22. arXiv:2402.18756  [pdf, other

    cs.CL

    How Much Annotation is Needed to Compare Summarization Models?

    Authors: Chantal Shaib, Joe Barrow, Alexa F. Siu, Byron C. Wallace, Ani Nenkova

    Abstract: Modern instruction-tuned models have become highly capable in text generation tasks such as summarization, and are expected to be released at a steady pace. In practice one may now wish to choose confidently, but with minimal effort, the best performing summarization model when applied to a new domain or purpose. In this work, we empirically investigate the test sample size necessary to select a p… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

    Comments: Preprint

  23. arXiv:2402.15663  [pdf, other

    cs.CL

    Leveraging ChatGPT in Pharmacovigilance Event Extraction: An Empirical Study

    Authors: Zhaoyue Sun, Gabriele Pergola, Byron C. Wallace, Yulan He

    Abstract: With the advent of large language models (LLMs), there has been growing interest in exploring their potential for medical applications. This research aims to investigate the ability of LLMs, specifically ChatGPT, in the context of pharmacovigilance event extraction, of which the main goal is to identify and extract adverse events or potential therapeutic events from textual medical sources. We con… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

    Comments: 14 pages, 2 figures, accepted by EACL 2024

  24. arXiv:2402.12566  [pdf, other

    cs.CL cs.LG

    GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

    Authors: Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C. Wallace, Zachary C. Lipton, Jeffrey P. Bigham

    Abstract: LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that ar… ▽ More

    Submitted 19 January, 2025; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: Code and models available at https://genaudit.org

  25. arXiv:2402.11456  [pdf, other

    cs.CL

    FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence

    Authors: Sebastian Antony Joseph, Lily Chen, Jan Trienes, Hannah Louisa Göke, Monika Coers, Wei Xu, Byron C Wallace, Junyi Jessy Li

    Abstract: Plain language summarization with LLMs can be useful for improving textual accessibility of technical content. But how factual are these summaries in a high-stakes domain like medicine? This paper presents FactPICO, a factuality benchmark for plain language summarization of medical texts describing randomized controlled trials (RCTs), which are the basis of evidence-based medicine and can directly… ▽ More

    Submitted 4 June, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

    Comments: Preprint has been updated to match the final revision for ACL 2024

  26. arXiv:2402.10109  [pdf, other

    cs.AI cs.CL cs.LG

    Towards Reducing Diagnostic Errors with Interpretable Risk Prediction

    Authors: Denis Jered McInerney, William Dickinson, Lucy C. Flynn, Andrea C. Young, Geoffrey S. Young, Jan-Willem van de Meent, Byron C. Wallace

    Abstract: Many diagnostic errors occur because clinicians cannot easily access relevant information in patient Electronic Health Records (EHRs). In this work we propose a method to use LLMs to identify pieces of evidence in patient EHR data that indicate increased or decreased risk of specific diagnoses; our ultimate aim is to increase access to evidence and reduce diagnostic errors. In particular, we propo… ▽ More

    Submitted 19 March, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

  27. arXiv:2402.03509  [pdf, other

    cs.CL cs.AI cs.LG

    Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains

    Authors: Sanjana Ramprasad, Kundan Krishna, Zachary C Lipton, Byron C Wallace

    Abstract: Recent work has shown that large language models (LLMs) are capable of generating summaries zero-shot (i.e., without explicit supervision) that, under human assessment, are often comparable or even preferred to manually composed reference summaries. However, this prior work has focussed almost exclusively on evaluating news article summarization. How do zero-shot summarizers perform in other (pote… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

  28. arXiv:2402.01700  [pdf

    cs.CL cs.AI

    Question answering systems for health professionals at the point of care -- a systematic review

    Authors: Gregory Kell, Angus Roberts, Serge Umansky, Linglong Qian, Davide Ferrari, Frank Soboczenski, Byron Wallace, Nikhil Patel, Iain J Marshall

    Abstract: Objective: Question answering (QA) systems have the potential to improve the quality of clinical care by providing health professionals with the latest and most relevant evidence. However, QA systems have not been widely adopted. This systematic review aims to characterize current medical QA systems, assess their suitability for healthcare, and identify areas of improvement. Materials and method… ▽ More

    Submitted 24 January, 2024; originally announced February 2024.

    Comments: Accepted to the Journal of the American Medical Informatics Association (JAMIA)

  29. arXiv:2401.16934  [pdf, other

    astro-ph.GA

    Extreme emission line galaxies detected in JADES JWST/NIRSpec I: inferred galaxy properties

    Authors: Kit Boyett, Andrew J. Bunker, Emma Curtis-Lake, Jacopo Chevallard, Alex J. Cameron, Gareth C. Jones, Aayush Saxena, Stéphane Charlot, Mirko Curti, Imaan E. B. Wallace, Santiago Arribas, Stefano Carniani, Chris Willott, Stacey Alberts, Daniel J. Eisenstein, Kevin Hainline, Ryan Hausen, Benjamin D. Johnson, Marcia Rieke, Brant Robertson, Daniel P. Stark, Sandro Tacchella, Christina C. Williams, Zuyi Chen, Eiichi Egami , et al. (11 additional authors not shown)

    Abstract: Extreme emission line galaxies (EELGs) exhibit large equivalent widths (EW) in their rest-optical emission lines ([OIII]$\lambda5007$ or H$α$ rest-frame EW$ > 750Å$) which can be tied to a recent upturn in star formation rate, due to the sensitivity of the nebular line emission and the rest-optical continuum to young ($<10$Myr) and evolved stellar populations, respectively. By studying a sample of… ▽ More

    Submitted 23 October, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: 34 pages, 25 figures

  30. arXiv:2401.16475  [pdf, other

    cs.CL

    InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification

    Authors: Jan Trienes, Sebastian Joseph, Jörg Schlötterer, Christin Seifert, Kyle Lo, Wei Xu, Byron C. Wallace, Junyi Jessy Li

    Abstract: Text simplification aims to make technical texts more accessible to laypeople but often results in deletion of information and vagueness. This work proposes InfoLossQA, a framework to characterize and recover simplification-induced information loss in form of question-and-answer (QA) pairs. Building on the theory of Question Under Discussion, the QA pairs are designed to help readers deepen their… ▽ More

    Submitted 4 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted at ACL 2024 (main conference)

  31. arXiv:2311.13978  [pdf, other

    cs.LG eess.IV

    MedISure: Towards Assuring Machine Learning-based Medical Image Classifiers using Mixup Boundary Analysis

    Authors: Adam Byfield, William Poulett, Ben Wallace, Anusha Jose, Shatakshi Tyagi, Smita Shembekar, Adnan Qayyum, Junaid Qadir, Muhammad Bilal

    Abstract: Machine learning (ML) models are becoming integral in healthcare technologies, presenting a critical need for formal assurance to validate their safety, fairness, robustness, and trustworthiness. These models are inherently prone to errors, potentially posing serious risks to patient health and could even cause irreparable harm. Traditional software assurance techniques rely on fixed code and do n… ▽ More

    Submitted 23 November, 2023; originally announced November 2023.

  32. arXiv:2311.12908  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    Diffusion Model Alignment Using Direct Preference Optimization

    Authors: Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik

    Abstract: Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs, human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality im… ▽ More

    Submitted 21 November, 2023; originally announced November 2023.

  33. arXiv:2311.11211  [pdf

    cs.AI

    Leveraging Generative AI for Clinical Evidence Summarization Needs to Ensure Trustworthiness

    Authors: Gongbo Zhang, Qiao Jin, Denis Jered McInerney, Yong Chen, Fei Wang, Curtis L. Cole, Qian Yang, Yanshan Wang, Bradley A. Malin, Mor Peleg, Byron C. Wallace, Zhiyong Lu, Chunhua Weng, Yifan Peng

    Abstract: Evidence-based medicine promises to improve the quality of healthcare by empowering medical decisions and practices with the best available evidence. The rapid growth of medical evidence, which can be obtained from various sources, poses a challenge in collecting, appraising, and synthesizing the evidential information. Recent advancements in generative AI, exemplified by large language models, ho… ▽ More

    Submitted 31 March, 2024; v1 submitted 18 November, 2023; originally announced November 2023.

  34. Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

    Authors: Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C. Wallace, David Bau

    Abstract: We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position $t$ in an input, can we reliably anticipate the tokens that will appear at positions $\geq t + 2$? To test this, we measure linear appr… ▽ More

    Submitted 8 November, 2023; originally announced November 2023.

    Comments: Accepted at CoNLL 2023

  35. arXiv:2310.15213  [pdf, other

    cs.CL cs.LG

    Function Vectors in Large Language Models

    Authors: Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, David Bau

    Abstract: We report the presence of a simple neural mechanism that represents an input-output function as a vector within autoregressive transformer language models (LMs). Using causal mediation analysis on a diverse range of in-context-learning (ICL) tasks, we find that a small number attention heads transport a compact representation of the demonstrated task, which we call a function vector (FV). FVs are… ▽ More

    Submitted 25 February, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: ICLR 2024. 52 pages, 30 figures, 23 tables. Code and data at https://functions.baulab.info

  36. arXiv:2309.04550  [pdf, other

    cs.CL

    Retrieving Evidence from EHRs with LLMs: Possibilities and Challenges

    Authors: Hiba Ahsan, Denis Jered McInerney, Jisoo Kim, Christopher Potter, Geoffrey Young, Silvio Amir, Byron C. Wallace

    Abstract: Unstructured data in Electronic Health Records (EHRs) often contains critical information -- complementary to imaging -- that could inform radiologists' diagnoses. But the large volume of notes often associated with patients together with time constraints renders manually identifying relevant evidence practically infeasible. In this work we propose and evaluate a zero-shot strategy for using LLMs… ▽ More

    Submitted 10 June, 2024; v1 submitted 8 September, 2023; originally announced September 2023.

  37. arXiv:2307.16862  [pdf, other

    eess.SY

    Modulation-Enhanced Excitation for Continuous-Time Reinforcement Learning via Symmetric Kronecker Products

    Authors: Brent A. Wallace, Jennie Si

    Abstract: This work introduces new results in continuous-time reinforcement learning (CT-RL) control of affine nonlinear systems to address a major algorithmic challenge due to a lack of persistence of excitation (PE). This PE design limitation has previously stifled CT-RL numerical performance and prevented these algorithms from achieving control synthesis goals. Our new theoretical developments in symmetr… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

  38. arXiv:2307.08920  [pdf, other

    eess.SY cs.AI cs.LG

    Continuous-Time Reinforcement Learning: New Design Algorithms with Theoretical Insights and Performance Guarantees

    Authors: Brent A. Wallace, Jennie Si

    Abstract: Continuous-time nonlinear optimal control problems hold great promise in real-world applications. After decades of development, reinforcement learning (RL) has achieved some of the greatest successes as a general nonlinear control design method. However, a recent comprehensive analysis of state-of-the-art continuous-time RL (CT-RL) methods, namely, adaptive dynamic programming (ADP)-based CT-RL al… ▽ More

    Submitted 17 July, 2023; originally announced July 2023.

  39. arXiv:2306.11270  [pdf, other

    cs.CL cs.LG

    Evaluating the Zero-shot Robustness of Instruction-tuned Language Models

    Authors: Jiuding Sun, Chantal Shaib, Byron C. Wallace

    Abstract: Instruction fine-tuning has recently emerged as a promising approach for improving the zero-shot capabilities of Large Language Models (LLMs) on new tasks. This technique has shown particular strength in improving the performance of modestly sized LLMs, sometimes inducing performance competitive with much larger model variants. In this paper we ask two questions: (1) How sensitive are instruction-… ▽ More

    Submitted 8 July, 2023; v1 submitted 19 June, 2023; originally announced June 2023.

  40. JADES NIRSpec Initial Data Release for the Hubble Ultra Deep Field: Redshifts and Line Fluxes of Distant Galaxies from the Deepest JWST Cycle 1 NIRSpec Multi-Object Spectroscopy

    Authors: Andrew J. Bunker, Alex J. Cameron, Emma Curtis-Lake, Peter Jakobsen, Stefano Carniani, Mirko Curti, Joris Witstok, Roberto Maiolino, Francesco D'Eugenio, Tobias J. Looser, Chris Willott, Nina Bonaventura, Kevin Hainline, Hannah Uebler, Christopher N. A. Willmer, Aayush Saxena, Renske Smit, Stacey Alberts, Santiago Arribas, William M. Baker, Stefi Baum, Rachana Bhatawdekar, Rebecca A. A. Bowler, Kristan Boyett, Stephane Charlot , et al. (41 additional authors not shown)

    Abstract: We describe the NIRSpec component of the JWST Deep Extragalactic Survey (JADES), and provide deep spectroscopy of 253 sources targeted with the NIRSpec micro-shutter assembly in the Hubble Ultra Deep Field and surrounding GOODS-South. The multi-object spectra presented here are the deepest so far obtained with JWST, amounting to up to 28 hours in the low-dispersion ($R\sim 30-300$) prism, and up t… ▽ More

    Submitted 31 May, 2024; v1 submitted 4 June, 2023; originally announced June 2023.

    Comments: Accepted for publication in A&A. Data products available from https://archive.stsci.edu/hlsp/jades

    Journal ref: A&A 690, A288 (2024)

  41. arXiv:2306.02466  [pdf, other

    astro-ph.GA astro-ph.CO

    JADES Initial Data Release for the Hubble Ultra Deep Field: Revealing the Faint Infrared Sky with Deep JWST NIRCam Imaging

    Authors: Marcia J. Rieke, Brant E. Robertson, Sandro Tacchella, Kevin Hainline, Benjamin D. Johnson, Ryan Hausan, Zhiyuan Ji, Christopher N. A. Willmer, Daniel J. Eisenstein, Dàvid Puskàs, Stacey Alberts, Santiago Arribas, William M. Baker, Stefi Baum, Rachana Bhatawdekar, Nina Bonaventura, Kit Boyett, Andrew Bunker, Alex J. Cameron, Stefano Carniani, Stephane Charlot, Jacopo Chevallard, Zuyi Chen, Mirko Curti, Emma Curtis-Lake , et al. (34 additional authors not shown)

    Abstract: JWST has revolutionized the field of extragalactic astronomy with its sensitive and high-resolution infrared view of the distant universe. Adding to the new legacy of JWST observations, we present the first NIRCam imaging data release from the JWST Advanced Deep Extragalactic Survey (JADES) providing 9 filters of infrared imaging of $\sim$25 arcmin$^2$ covering the Hubble Ultra Deep Field and port… ▽ More

    Submitted 1 September, 2023; v1 submitted 4 June, 2023; originally announced June 2023.

    Comments: Several figures were modified to use better line styles. A brief comparison to IRAC Channel 1 photometry was added along with a few other clarifications. Paper has been accepted for publication in ApJS

  42. arXiv:2306.02465  [pdf, other

    astro-ph.GA

    Overview of the JWST Advanced Deep Extragalactic Survey (JADES)

    Authors: Daniel J. Eisenstein, Chris Willott, Stacey Alberts, Santiago Arribas, Nina Bonaventura, Andrew J. Bunker, Alex J. Cameron, Stefano Carniani, Stephane Charlot, Emma Curtis-Lake, Francesco D'Eugenio, Ryan Endsley, Pierre Ferruit, Giovanna Giardino, Kevin Hainline, Ryan Hausen, Peter Jakobsen, Benjamin D. Johnson, Roberto Maiolino, Marcia Rieke, George Rieke, Hans-Walter Rix, Brant Robertson, Daniel P. Stark, Sandro Tacchella , et al. (51 additional authors not shown)

    Abstract: We present an overview of the James Webb Space Telescope (JWST) Advanced Deep Extragalactic Survey (JADES), an ambitious program of infrared imaging and spectroscopy in the GOODS-S and GOODS-N deep fields, designed to study galaxy evolution from high redshift to cosmic noon. JADES uses about 770 hours of Cycle 1 guaranteed time largely from the Near-Infrared Camera (NIRCam) and Near-Infrared Spect… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: 33 pages, submitted to ApJ Supplement. The JADES Collaboration web site is at https://jades-survey.github.io, and the initial data release is available at https://archive.stsci.edu/hlsp/jades with a viewer at http://jades.idies.jhu.edu

  43. arXiv:2305.14296  [pdf, other

    cs.CL cs.LG

    USB: A Unified Summarization Benchmark Across Tasks and Domains

    Authors: Kundan Krishna, Prakhar Gupta, Sanjana Ramprasad, Byron C. Wallace, Jeffrey P. Bigham, Zachary C. Lipton

    Abstract: While the NLP community has produced numerous summarization benchmarks, none provide the rich annotations required to simultaneously address many important problems related to control and reliability. We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks: (i) extractive summarization; (ii) abstractive summarization… ▽ More

    Submitted 4 December, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: EMNLP Findings 2023 Camera Ready

  44. arXiv:2305.13693  [pdf, other

    cs.CL

    Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations

    Authors: Lucy Lu Wang, Yulia Otmakhova, Jay DeYoung, Thinh Hung Truong, Bailey E. Kuehl, Erin Bransom, Byron C. Wallace

    Abstract: Evaluating multi-document summarization (MDS) quality is difficult. This is especially true in the case of MDS for biomedical literature reviews, where models must synthesize contradicting evidence reported across different documents. Prior work has shown that rather than performing the task, models may exploit shortcuts that are difficult to detect using standard n-gram similarity metrics such as… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: ACL 2023; Github: https://github.com/allenai/mslr-annotated-dataset

  45. arXiv:2305.12532  [pdf, other

    cs.CL

    Multilingual Simplification of Medical Texts

    Authors: Sebastian Joseph, Kathryn Kazanas, Keziah Reina, Vishnesh J. Ramanathan, Wei Xu, Byron C. Wallace, Junyi Jessy Li

    Abstract: Automated text simplification aims to produce simple versions of complex texts. This task is especially useful in the medical domain, where the latest medical findings are typically communicated via complex and technical articles. This creates barriers for laypeople seeking access to up-to-date medical findings, consequently impeding progress on health literacy. Most existing work on medical text… ▽ More

    Submitted 18 October, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: This version will be in EMNLP 2023 main

  46. arXiv:2305.11828  [pdf, other

    cs.CL cs.AI cs.HC

    Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews

    Authors: Hye Sun Yun, Iain J. Marshall, Thomas A. Trikalinos, Byron C. Wallace

    Abstract: Medical systematic reviews play a vital role in healthcare decision making and policy. However, their production is time-consuming, limiting the availability of high-quality and up-to-date evidence summaries. Recent advancements in large language models (LLMs) offer the potential to automatically generate literature reviews on demand, addressing this issue. However, LLMs sometimes generate inaccur… ▽ More

    Submitted 18 October, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: 18 pages, 2 figures, 8 tables. Accepted as an EMNLP 2023 main paper

  47. arXiv:2305.06299  [pdf, other

    cs.CL

    Summarizing, Simplifying, and Synthesizing Medical Evidence Using GPT-3 (with Varying Success)

    Authors: Chantal Shaib, Millicent L. Li, Sebastian Joseph, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace

    Abstract: Large language models, particularly GPT-3, are able to produce high quality summaries of general domain news articles in few- and zero-shot settings. However, it is unclear if such models are similarly capable in more specialized, high-stakes domains such as biomedicine. In this paper, we enlist domain experts (individuals with medical training) to evaluate summaries of biomedical articles generat… ▽ More

    Submitted 11 May, 2023; v1 submitted 10 May, 2023; originally announced May 2023.

    Comments: Accepted short paper to ACL 2023

  48. arXiv:2305.05003  [pdf, other

    cs.CL

    Revisiting Relation Extraction in the era of Large Language Models

    Authors: Somin Wadhwa, Silvio Amir, Byron C. Wallace

    Abstract: Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a \emph{sequence-to-sequence} task, linearizing relations between entities as target strings to be… ▽ More

    Submitted 16 July, 2024; v1 submitted 8 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023

  49. arXiv:2305.03642  [pdf, other

    cs.CL

    Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs

    Authors: Somin Wadhwa, Jay DeYoung, Benjamin Nye, Silvio Amir, Byron C. Wallace

    Abstract: Results from Randomized Controlled Trials (RCTs) establish the comparative effectiveness of interventions, and are in turn critical inputs for evidence-based care. However, results from RCTs are presented in (often unstructured) natural language articles describing the design, execution, and outcomes of trials; clinicians must manually extract findings pertaining to interventions and outcomes of i… ▽ More

    Submitted 17 July, 2023; v1 submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted to MLHC 2023

  50. arXiv:2303.13703  [pdf, other

    cs.CV cs.AI cs.LG

    End-to-End Diffusion Latent Optimization Improves Classifier Guidance

    Authors: Bram Wallace, Akash Gokul, Stefano Ermon, Nikhil Naik

    Abstract: Classifier guidance -- using the gradients of an image classifier to steer the generations of a diffusion model -- has the potential to dramatically expand the creative control over image generation and editing. However, currently classifier guidance requires either training new noise-aware models to obtain accurate gradients or using a one-step denoising approximation of the final generation, whi… ▽ More

    Submitted 31 May, 2023; v1 submitted 23 March, 2023; originally announced March 2023.