Skip to main content

Showing 1–50 of 120 results for author: Cohen, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.11244  [pdf, ps, other

    cs.CL

    Iterative Multilingual Spectral Attribute Erasure

    Authors: Shun Shao, Yftah Ziser, Zheng Zhao, Yifu Qiu, Shay B. Cohen, Anna Korhonen

    Abstract: Multilingual representations embed words with similar meanings to share a common semantic space across languages, creating opportunities to transfer debiasing effects between languages. However, existing methods for debiasing are unable to exploit this opportunity because they operate on individual languages. We present Iterative Multilingual Spectral Attribute Erasure (IMSAE), which identifies an… ▽ More

    Submitted 12 June, 2025; originally announced June 2025.

    Comments: 8 pages, 3 figures

  2. arXiv:2506.09902  [pdf, ps, other

    cs.CL cs.AI cs.LG

    PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants

    Authors: Zheng Zhao, Clara Vania, Subhradeep Kayal, Naila Khan, Shay B. Cohen, Emine Yilmaz

    Abstract: Large language models (LLMs) have advanced conversational AI assistants. However, systematically evaluating how well these assistants apply personalization--adapting to individual user preferences while completing tasks--remains challenging. Existing personalization benchmarks focus on chit-chat, non-conversational tasks, or narrow domains, failing to capture the complexities of personalized task-… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: Accepted to ACL 2025 Findings

  3. arXiv:2506.08231  [pdf

    cs.LG cs.AI cs.PF

    Ensuring Reliability of Curated EHR-Derived Data: The Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework

    Authors: Melissa Estevez, Nisha Singh, Lauren Dyson, Blythe Adamson, Qianyu Yuan, Megan W. Hildner, Erin Fidyk, Olive Mbah, Farhad Khan, Kathi Seidl-Rathkopf, Aaron B. Cohen

    Abstract: Large language models (LLMs) are increasingly used to extract clinical data from electronic health records (EHRs), offering significant improvements in scalability and efficiency for real-world data (RWD) curation in oncology. However, the adoption of LLMs introduces new challenges in ensuring the reliability, accuracy, and fairness of extracted data, which are essential for research, regulatory,… ▽ More

    Submitted 9 June, 2025; originally announced June 2025.

    Comments: 18 pages, 3 tables, 1 figure

  4. arXiv:2506.06006  [pdf, ps, other

    cs.CV cs.AI cs.CL

    Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models

    Authors: Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti

    Abstract: To what extent do vision-and-language foundation models possess a realistic world model (observation $\times$ action $\rightarrow$ observation) and a dynamics model (observation $\times$ observation $\rightarrow$ action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision… ▽ More

    Submitted 6 June, 2025; originally announced June 2025.

  5. arXiv:2505.17801  [pdf, ps, other

    cs.AI

    Integrating Counterfactual Simulations with Language Models for Explaining Multi-Agent Behaviour

    Authors: Bálint Gyevnár, Christopher G. Lucas, Stefano V. Albrecht, Shay B. Cohen

    Abstract: Autonomous multi-agent systems (MAS) are useful for automating complex tasks but raise trust concerns due to risks like miscoordination and goal misalignment. Explainability is vital for trust calibration, but explainable reinforcement learning for MAS faces challenges in state/action space complexity, stakeholder needs, and evaluation. Using the counterfactual theory of causation and LLMs' summar… ▽ More

    Submitted 23 May, 2025; originally announced May 2025.

  6. arXiv:2505.14766  [pdf, ps, other

    cs.LG cs.AI

    This Time is Different: An Observability Perspective on Time Series Foundation Models

    Authors: Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, Zongzhe Xu, Viktoriya Zhukova, David Asker, Ameet Talwalkar, Othmane Abou-Amal

    Abstract: We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto's pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10$\times$ larger th… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  7. arXiv:2505.05291  [pdf, other

    eess.IV cs.AI cs.CV q-bio.TO

    Benchmarking Ophthalmology Foundation Models for Clinically Significant Age Macular Degeneration Detection

    Authors: Benjamin A. Cohen, Jonathan Fhima, Meishar Meisel, Baskin Meital, Luis Filipe Nakayama, Eran Berkowitz, Joachim A. Behar

    Abstract: Self-supervised learning (SSL) has enabled Vision Transformers (ViTs) to learn robust representations from large-scale natural image datasets, enhancing their generalization across domains. In retinal imaging, foundation models pretrained on either natural or ophthalmic data have shown promise, but the benefits of in-domain pretraining remain uncertain. To investigate this, we benchmark six SSL-pr… ▽ More

    Submitted 22 May, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

    Comments: 10 pages, 3 figures

  8. arXiv:2504.12971  [pdf, other

    cs.LG cs.AI

    Transferrable Surrogates in Expressive Neural Architecture Search Spaces

    Authors: Shiwen Qin, Gabriela Kadlecová, Martin Pilát, Shay B. Cohen, Roman Neruda, Elliot J. Crowley, Jovita Lukasik, Linus Ericsson

    Abstract: Neural architecture search (NAS) faces a challenge in balancing the exploration of expressive, broad search spaces that enable architectural innovation with the need for efficient evaluation of architectures to effectively search such spaces. We investigate surrogate model training for improving search in highly expressive NAS search spaces based on context-free grammars. We show that i) surrogate… ▽ More

    Submitted 18 April, 2025; v1 submitted 17 April, 2025; originally announced April 2025.

    Comments: Project page at: https://shiwenqin.github.io/TransferrableSurrogate/

  9. arXiv:2504.12494  [pdf

    cs.CL

    Accelerating Clinical NLP at Scale with a Hybrid Framework with Reduced GPU Demands: A Case Study in Dementia Identification

    Authors: Jianlin Shi, Qiwei Gan, Elizabeth Hanchrow, Annie Bowles, John Stanley, Adam P. Bress, Jordana B. Cohen, Patrick R. Alba

    Abstract: Clinical natural language processing (NLP) is increasingly in demand in both clinical research and operational practice. However, most of the state-of-the-art solutions are transformers-based and require high computational resources, limiting their accessibility. We propose a hybrid NLP framework that integrates rule-based filtering, a Support Vector Machine (SVM) classifier, and a BERT-based mode… ▽ More

    Submitted 16 April, 2025; originally announced April 2025.

    Comments: This manuscript has been submitted to AMIA 2025 annual symposium (https://amia.org/education-events/amia-2025-annual-symposium)

  10. arXiv:2502.13137  [pdf, other

    cs.AI

    Theorem Prover as a Judge for Synthetic Data Generation

    Authors: Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen

    Abstract: The demand for synthetic data in mathematical reasoning has increased due to its potential to enhance the mathematical capabilities of large language models (LLMs). However, ensuring the validity of intermediate reasoning steps remains a significant challenge, affecting data quality. While formal verification via theorem provers effectively validates LLM reasoning, the autoformalisation of mathema… ▽ More

    Submitted 18 February, 2025; originally announced February 2025.

  11. A Family-Based Approach to Safety Cases for Controlled Airspaces in Small Uncrewed Aerial Systems

    Authors: Michael C. Hunter, Usman Gohar, Myra B. Cohen, Robyn R. Lutz, Jane Cleland-Huang

    Abstract: As small Uncrewed Aircraft Systems (sUAS) increasingly operate in the national airspace, safety concerns arise due to a corresponding rise in reported airspace violations and incidents, highlighting the need for a safe mechanism for sUAS entry control to manage the potential overload. This paper presents work toward our aim of establishing automated, customized safety-claim support for managing on… ▽ More

    Submitted 4 February, 2025; originally announced February 2025.

    Comments: Accepted at AIAA 2024

  12. arXiv:2502.00238  [pdf, other

    cs.SE

    A Taxonomy of Real-World Defeaters in Safety Assurance Cases

    Authors: Usman Gohar, Michael C. Hunter, Myra B. Cohen, Robyn R. Lutz

    Abstract: The rise of cyber-physical systems in safety-critical domains calls for robust risk-evaluation frameworks. Assurance cases, often required by regulatory bodies, are a structured approach to demonstrate that a system meets its safety requirements. However, assurance cases are fraught with challenges, such as incomplete evidence and gaps in reasoning, called defeaters, that can call into question th… ▽ More

    Submitted 31 January, 2025; originally announced February 2025.

    Comments: ICSE 2025, Workshop on Multi-disciplinary, Open, and integRatEd Requirements Engineering

  13. arXiv:2501.08248  [pdf, ps, other

    cs.CL cs.AI cs.IR cs.LG

    Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models

    Authors: Yifu Qiu, Varun Embar, Yizhe Zhang, Navdeep Jaitly, Shay B. Cohen, Benjamin Han

    Abstract: Recent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly -- a capability we define as In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often overestimate LC… ▽ More

    Submitted 9 June, 2025; v1 submitted 14 January, 2025; originally announced January 2025.

  14. arXiv:2411.15068  [pdf, other

    cs.CL

    Locating the Leading Edge of Cultural Change

    Authors: Sarah Griebel, Becca Cohen, Lucian Li, Jaihyun Park, Jiayu Liu, Jana Perkins, Ted Underwood

    Abstract: Measures of textual similarity and divergence are increasingly used to study cultural change. But which measures align, in practice, with social evidence about change? We apply three different representations of text (topic models, document embeddings, and word-level perplexity) to three different corpora (literary studies, economics, and fiction). In every case, works by highly-cited authors and… ▽ More

    Submitted 22 November, 2024; originally announced November 2024.

    Comments: Accepted CHR 2024

  15. TSPRank: Bridging Pairwise and Listwise Methods with a Bilinear Travelling Salesman Model

    Authors: Weixian Waylon Li, Yftah Ziser, Yifei Xie, Shay B. Cohen, Tiejun Ma

    Abstract: Traditional Learning-To-Rank (LETOR) approaches, including pairwise methods like RankNet and LambdaMART, often fall short by solely focusing on pairwise comparisons, leading to sub-optimal global rankings. Conversely, deep learning based listwise methods, while aiming to optimise entire lists, require complex tuning and yield only marginal improvements over robust pairwise models. To overcome thes… ▽ More

    Submitted 23 March, 2025; v1 submitted 18 November, 2024; originally announced November 2024.

    Comments: Accepted to ACM SIGKDD 2025 Research Track. The code and preprocessed data are available at https://github.com/waylonli/TSPRank-KDD2025

  16. arXiv:2410.20008  [pdf, other

    cs.CL cs.LG

    Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models

    Authors: Zheng Zhao, Yftah Ziser, Shay B. Cohen

    Abstract: Fine-tuning pre-trained large language models (LLMs) on a diverse array of tasks has become a common approach for building models that can solve various natural language processing (NLP) tasks. However, where and to what extent these models retain task-specific knowledge remains largely unexplored. This study investigates the task-specific information encoded in pre-trained LLMs and the effects of… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

    Comments: Accepted to EMNLP 2024

  17. arXiv:2410.10614  [pdf, other

    cs.CE cs.AI cs.CL q-fin.CP

    Modeling News Interactions and Influence for Financial Market Prediction

    Authors: Mengyu Wang, Shay B. Cohen, Tiejun Ma

    Abstract: The diffusion of financial news into market prices is a complex process, making it challenging to evaluate the connections between news events and market movements. This paper introduces FININ (Financial Interconnected News Influence Network), a novel market prediction model that captures not only the links between news and prices but also the interactions among news items themselves. FININ effect… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: Accepted by EMNLP 2024

  18. arXiv:2410.10336  [pdf, other

    cs.AI cs.CL cs.LG cs.SC

    CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning

    Authors: Joshua Ong Jun Leang, Aryo Pradipta Gema, Shay B. Cohen

    Abstract: Mathematical reasoning remains a significant challenge for large language models (LLMs), despite progress in prompting techniques such as Chain-of-Thought (CoT). We present Chain of Mathematically Annotated Thought (CoMAT), which enhances reasoning through two stages: Symbolic Conversion (converting natural language queries into symbolic form) and Reasoning Execution (deriving answers from symboli… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: 8 pages, 12 figures

  19. arXiv:2410.08811  [pdf, ps, other

    cs.CR cs.AI cs.CL

    PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

    Authors: Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, Fazl Barez

    Abstract: Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content o… ▽ More

    Submitted 6 June, 2025; v1 submitted 11 October, 2024; originally announced October 2024.

    Comments: Accepted at ICML 2025. Tingchen Fu and Fazl Barez are core research contributors

  20. arXiv:2408.11081  [pdf, other

    cs.SE cs.AI cs.CL cs.LG

    What can Large Language Models Capture about Code Functional Equivalence?

    Authors: Nickil Maveli, Antonio Vergari, Shay B. Cohen

    Abstract: Code-LLMs, LLMs pre-trained on large code corpora, have shown great progress in learning rich representations of the structure and syntax of code, successfully using it to generate or classify code fragments. At the same time, understanding if they are able to do so because they capture code semantics, and how well, is still an open question. In this paper, we tackle this problem by introducing Se… ▽ More

    Submitted 12 February, 2025; v1 submitted 20 August, 2024; originally announced August 2024.

    Comments: Accepted to Findings of NAACL 2025

  21. arXiv:2408.00913  [pdf, ps, other

    cs.NI cs.ET

    Design and Implementation of ARA Wireless Living Lab for Rural Broadband and Applications

    Authors: Taimoor Ul Islam, Joshua Ofori Boateng, Md Nadim, Guoying Zu, Mukaram Shahid, Xun Li, Tianyi Zhang, Salil Reddy, Wei Xu, Ataberk Atalar, Vincent Lee, Yung-Fu Chen, Evan Gosling, Elisabeth Permatasari, Christ Somiah, Owen Perrin, Zhibo Meng, Reshal Afzal, Sarath Babu, Mohammed Soliman, Ali Hussain, Daji Qiao, Mai Zheng, Ozdal Boyraz, Yong Guan , et al. (9 additional authors not shown)

    Abstract: Addressing the broadband gap between rural and urban regions requires rural-focused wireless research and innovation. In the meantime, rural regions provide rich, diverse use cases of advanced wireless, and they offer unique real-world settings for piloting applications that advance the frontiers of wireless systems (e.g., teleoperation of ground and aerial vehicles). To fill the broadband gap and… ▽ More

    Submitted 28 May, 2025; v1 submitted 1 August, 2024; originally announced August 2024.

    Comments: 47 pages, 18 figures

  22. arXiv:2407.13717  [pdf, other

    cs.SE cs.AI

    CoDefeater: Using LLMs To Find Defeaters in Assurance Cases

    Authors: Usman Gohar, Michael C. Hunter, Robyn R. Lutz, Myra B. Cohen

    Abstract: Constructing assurance cases is a widely used, and sometimes required, process toward demonstrating that safety-critical systems will operate safely in their planned environment. To mitigate the risk of errors and missing edge cases, the concept of defeaters - arguments or evidence that challenge claims in an assurance case - has been introduced. Defeaters can provide timely detection of weaknesse… ▽ More

    Submitted 16 August, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: ASE 2024 NIER

  23. arXiv:2407.07874  [pdf, ps, other

    cs.LG cs.AI

    Toto: Time Series Optimized Transformer for Observability

    Authors: Ben Cohen, Emaad Khwaja, Kan Wang, Charles Masson, Elise Ramé, Youssef Doubli, Othmane Abou-Amal

    Abstract: This technical report describes the Time Series Optimized Transformer for Observability (Toto), a new state of the art foundation model for time series forecasting developed by Datadog. In addition to advancing the state of the art on generalized time series benchmarks in domains such as electricity and weather, this model is the first general-purpose time series forecasting foundation model to be… ▽ More

    Submitted 11 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

  24. arXiv:2407.03277  [pdf, other

    cs.CL

    Evaluating Automatic Metrics with Incremental Machine Translation Systems

    Authors: Guojun Wu, Shay B. Cohen, Rico Sennrich

    Abstract: We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. Since human A/B testing is commonly used, we assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations. Our study not only confirms several prior findings, such as… ▽ More

    Submitted 3 October, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

  25. arXiv:2405.20838  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    einspace: Searching for Neural Architectures from Fundamental Operations

    Authors: Linus Ericsson, Miguel Espinosa, Chenhongyi Yang, Antreas Antoniou, Amos Storkey, Shay B. Cohen, Steven McDonagh, Elliot J. Crowley

    Abstract: Neural architecture search (NAS) finds high performing networks for a given task. Yet the results of NAS are fairly prosaic; they did not e.g. create a shift from convolutional structures to transformers. This is not least because the search spaces in NAS often aren't diverse enough to include such transformations a priori. Instead, for NAS to provide greater potential for fundamental design shift… ▽ More

    Submitted 30 October, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

    Comments: NeurIPS 2024. Project page at https://linusericsson.github.io/einspace/

  26. arXiv:2405.09719  [pdf, other

    cs.CL cs.AI cs.LG

    Spectral Editing of Activations for Large Language Model Alignment

    Authors: Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen

    Abstract: Large language models (LLMs) often exhibit undesirable behaviours, such as generating untruthful or biased content. Editing their internal representations has been shown to be effective in mitigating such behaviours on top of the existing alignment methods. We propose a novel inference-time editing method, namely spectral editing of activations (SEA), to project the input representations into dire… ▽ More

    Submitted 3 November, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

    Comments: 24 pages, NeurIPS 2024

  27. arXiv:2403.13312  [pdf, other

    cs.CL

    LeanReasoner: Boosting Complex Logical Reasoning with Lean

    Authors: Dongwei Jiang, Marcio Fonseca, Shay B. Cohen

    Abstract: Large language models (LLMs) often struggle with complex logical reasoning due to logical inconsistencies and the inherent difficulty of such reasoning. We use Lean, a theorem proving framework, to address these challenges. By formalizing logical reasoning problems into theorems within Lean, we can solve them by proving or disproving the corresponding theorems. This method reduces the risk of logi… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Accepted to NAACL 2024 main conference

  28. arXiv:2403.08828  [pdf, other

    cs.HC cs.AI cs.RO

    People Attribute Purpose to Autonomous Vehicles When Explaining Their Behavior: Insights from Cognitive Science for Explainable AI

    Authors: Balint Gyevnar, Stephanie Droop, Tadeg Quillien, Shay B. Cohen, Neil R. Bramley, Christopher G. Lucas, Stefano V. Albrecht

    Abstract: It is often argued that effective human-centered explainable artificial intelligence (XAI) should resemble human reasoning. However, empirical investigations of how concepts from cognitive science can aid the design of XAI are lacking. Based on insights from cognitive science, we propose a framework of explanatory modes to analyze how people frame explanations, whether mechanistic, teleological, o… ▽ More

    Submitted 3 February, 2025; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: CHI 2025

  29. arXiv:2402.15055  [pdf, other

    cs.CL cs.AI cs.LG

    Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

    Authors: Clement Neo, Shay B. Cohen, Fazl Barez

    Abstract: Understanding the inner workings of large language models (LLMs) is crucial for advancing their theoretical foundations and real-world applications. While the attention mechanism and multi-layer perceptrons (MLPs) have been studied independently, their interactions remain largely unexplored. This study investigates how attention heads and next-token neurons interact in LLMs to predict new words. W… ▽ More

    Submitted 23 October, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: Accepted to EMNLP 2024 Main Conference

  30. arXiv:2402.10643  [pdf, other

    cs.CL cs.AI

    `Keep it Together': Enforcing Cohesion in Extractive Summaries by Simulating Human Memory

    Authors: Ronald Cardenas, Matthias Galle, Shay B. Cohen

    Abstract: Extractive summaries are usually presented as lists of sentences with no expected cohesion between them. In this paper, we aim to enforce cohesion whilst controlling for informativeness and redundancy in summaries, in cases where the input exhibits high redundancy. The pipeline controls for redundancy in long inputs as it is consumed, and balances informativeness and cohesion during sentence selec… ▽ More

    Submitted 16 February, 2024; originally announced February 2024.

  31. arXiv:2401.10415  [pdf, other

    cs.CL cs.AI

    Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals?

    Authors: Marcio Fonseca, Shay B. Cohen

    Abstract: In this work, we investigate the controllability of large language models (LLMs) on scientific summarization tasks. We identify key stylistic and content coverage factors that characterize different types of summaries such as paper reviews, abstracts, and lay summaries. By controlling stylistic features, we find that non-fine-tuned LLMs outperform humans in the MuP review generation task, both in… ▽ More

    Submitted 27 June, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: ACL 2024 camera ready

  32. arXiv:2401.07353  [pdf, other

    cs.SE cs.AI cs.LG

    Towards Engineering Fair and Equitable Software Systems for Managing Low-Altitude Airspace Authorizations

    Authors: Usman Gohar, Michael C. Hunter, Agnieszka Marczak-Czajka, Robyn R. Lutz, Myra B. Cohen, Jane Cleland-Huang

    Abstract: Small Unmanned Aircraft Systems (sUAS) have gained widespread adoption across a diverse range of applications. This has introduced operational complexities within shared airspaces and an increase in reported incidents, raising safety concerns. In response, the U.S. Federal Aviation Administration (FAA) is developing a UAS Traffic Management (UTM) system to control access to airspace based on an sU… ▽ More

    Submitted 3 February, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

    Journal ref: ICSE-SEIS 2024

  33. arXiv:2401.01814  [pdf, other

    cs.AI

    Large Language Models Relearn Removed Concepts

    Authors: Michelle Lo, Shay B. Cohen, Fazl Barez

    Abstract: Advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. However, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. To investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining. Our findings reveal that mod… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  34. arXiv:2312.03480  [pdf, other

    cs.CL

    AMR Parsing is Far from Solved: GrAPES, the Granular AMR Parsing Evaluation Suite

    Authors: Jonas Groschwitz, Shay B. Cohen, Lucia Donatelli, Meaghan Fowlie

    Abstract: We present the Granular AMR Parsing Evaluation Suite (GrAPES), a challenge set for Abstract Meaning Representation (AMR) parsing with accompanying evaluation metrics. AMR parsers now obtain high scores on the standard AMR evaluation metric Smatch, close to or even above reported inter-annotator agreement. But that does not mean that AMR parsing is solved; in fact, human evaluation in previous work… ▽ More

    Submitted 6 December, 2023; originally announced December 2023.

    Comments: Accepted at EMNLP 2023. For the associated GitHub repository, see https://github.com/jgroschwitz/GrAPES

    ACM Class: J.5

  35. arXiv:2311.09467  [pdf, other

    cs.CL cs.AI

    Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation

    Authors: Yifu Qiu, Varun Embar, Shay B. Cohen, Benjamin Han

    Abstract: Knowledge-to-text generators often struggle to faithfully generate descriptions for the input facts: they may produce hallucinations that contradict the input, or describe facts not present in the input. To reduce hallucinations, we propose a decoding-only method, TWEAK (Think While Effectively Articulating Knowledge), which can be integrated with any generator without retraining. TWEAK treats the… ▽ More

    Submitted 3 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: NAACL 2024 (Findings)

  36. arXiv:2311.08704  [pdf, other

    cs.CL cs.AI

    Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains

    Authors: Marcio Fonseca, Shay B. Cohen

    Abstract: Although large language models (LLMs) exhibit remarkable capacity to leverage in-context demonstrations, it is still unclear to what extent they can learn new concepts or facts from ground-truth labels. To address this question, we examine the capacity of instruction-tuned LLMs to follow in-context concept guidelines for sentence labeling tasks. We design guidelines that present different types of… ▽ More

    Submitted 26 June, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

    Comments: ACL 2024 camera ready

  37. arXiv:2311.08398  [pdf, other

    cs.CL cs.AI

    Are Large Language Models Temporally Grounded?

    Authors: Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen

    Abstract: Are Large language models (LLMs) temporally grounded? Since LLMs cannot perceive and interact with the environment, it is impossible to answer this question directly. Instead, we provide LLMs with textual narratives and probe them with respect to their common-sense knowledge of the structure and duration of events, their ability to order events along a timeline, and self-consistency within their t… ▽ More

    Submitted 16 November, 2023; v1 submitted 14 November, 2023; originally announced November 2023.

  38. arXiv:2310.15513  [pdf, other

    cs.CL

    A Joint Matrix Factorization Analysis of Multilingual Representations

    Authors: Zheng Zhao, Yftah Ziser, Bonnie Webber, Shay B. Cohen

    Abstract: We present an analysis tool based on joint matrix factorization for comparing latent representations of multilingual and monolingual models. An alternative to probing, this tool allows us to analyze multiple sets of representations in a joint manner. Using this tool, we study to what extent and how morphosyntactic features are reflected in the representations learned by multilingual pre-trained mo… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: Accepted to Findings of EMNLP 2023

  39. HIFuzz: Human Interaction Fuzzing for small Unmanned Aerial Vehicles

    Authors: Theodore Chambers, Michael Vierhauser, Ankit Agrawal, Michael Murphy, Jason Matthew Brauer, Salil Purandare, Myra B. Cohen, Jane Cleland-Huang

    Abstract: Small Unmanned Aerial Systems (sUAS) must meet rigorous safety standards when deployed in high-stress emergency response scenarios; however many reported accidents have involved humans in the loop. In this paper, we, therefore, present the HiFuzz testing framework, which uses fuzz testing to identify system vulnerabilities associated with human interactions. HiFuzz includes three distinct levels t… ▽ More

    Submitted 7 April, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

  40. arXiv:2305.19734  [pdf, other

    cs.AI cs.CL cs.DB

    Knowledge Base Question Answering for Space Debris Queries

    Authors: Paul Darm, Antonio Valerio Miceli-Barone, Shay B. Cohen, Annalisa Riccardi

    Abstract: Space agencies execute complex satellite operations that need to be supported by the technical knowledge contained in their extensive information systems. Knowledge bases (KB) are an effective way of storing and accessing such information at scale. In this work we present a system, developed for the European Space Agency (ESA), that can answer complex natural language queries, to support engineers… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Comments: 7 pages, ACL 2023 industry track

    ACM Class: I.2.7

  41. arXiv:2305.16947  [pdf, other

    cs.CL

    Sentence-Incremental Neural Coreference Resolution

    Authors: Matt Grenander, Shay B. Cohen, Mark Steedman

    Abstract: We propose a sentence-incremental neural coreference resolution system which incrementally builds clusters after marking mention boundaries in a shift-reduce method. The system is aimed at bridging two recent approaches at coreference resolution: (1) state-of-the-art non-incremental models that incur quadratic complexity in document length with high computational cost, and (2) memory network-based… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted at EMNLP 2022

  42. arXiv:2305.15507  [pdf, other

    cs.CL cs.AI

    The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

    Authors: Antonio Valerio Miceli-Barone, Fazl Barez, Ioannis Konstas, Shay B. Cohen

    Abstract: Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to prop… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 17 pages, 5 figure, ACL 2023

  43. arXiv:2305.13632  [pdf, other

    cs.CL cs.AI cs.LG

    Detecting and Mitigating Hallucinations in Multilingual Summarisation

    Authors: Yifu Qiu, Yftah Ziser, Anna Korhonen, Edoardo M. Ponti, Shay B. Cohen

    Abstract: Hallucinations pose a significant challenge to the reliability of neural models for abstractive summarisation. While automatically generated summaries may be fluent, they often lack faithfulness to the original document. This issue becomes even more pronounced in low-resource settings, such as cross-lingual transfer. With the existing faithful metrics focusing on English, even measuring the extent… ▽ More

    Submitted 26 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  44. arXiv:2305.08828  [pdf, other

    cs.CL

    PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India

    Authors: Ashok Urlana, Pinzhen Chen, Zheng Zhao, Shay B. Cohen, Manish Shrivastava, Barry Haddow

    Abstract: This paper introduces PMIndiaSum, a multilingual and massively parallel summarization corpus focused on languages in India. Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs. We detail our construction workflow including data acquisition, processing, and quality assurance. Furthermore, we publish benchmarks f… ▽ More

    Submitted 19 October, 2023; v1 submitted 15 May, 2023; originally announced May 2023.

    Comments: Findings of EMNLP 2023

    ACM Class: I.2.7

  45. arXiv:2302.10809  [pdf, other

    cs.AI cs.RO

    Causal Explanations for Sequential Decision-Making in Multi-Agent Systems

    Authors: Balint Gyevnar, Cheng Wang, Christopher G. Lucas, Shay B. Cohen, Stefano V. Albrecht

    Abstract: We present CEMA: Causal Explanations in Multi-Agent systems; a framework for creating causal natural language explanations of an agent's decisions in dynamic sequential multi-agent systems to build more trustworthy autonomous agents. Unlike prior work that assumes a fixed causal structure, CEMA only requires a probabilistic model for forward-simulating the state of the system. Using such a model,… ▽ More

    Submitted 14 February, 2024; v1 submitted 21 February, 2023; originally announced February 2023.

    Comments: Accepted in 23rd International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS), 2024

    ACM Class: I.2.9

  46. arXiv:2302.09350  [pdf, other

    cs.CL

    BERT is not The Count: Learning to Match Mathematical Statements with Proofs

    Authors: Weixian Waylon Li, Yftah Ziser, Maximin Coavoux, Shay B. Cohen

    Abstract: We introduce a task consisting in matching a proof to a given mathematical statement. The task fits well within current research on Mathematical Information Retrieval and, more generally, mathematical article analysis (Mathematical Sciences, 2014). We present a dataset for the task (the MATcH dataset) consisting of over 180k statement-proof pairs extracted from modern mathematical research article… ▽ More

    Submitted 18 February, 2023; originally announced February 2023.

    Comments: Accepted to the Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023; 14 pages. arXiv admin note: substantial text overlap with arXiv:2102.02110

  47. arXiv:2211.13807  [pdf, other

    cs.CV

    GEFF: Improving Any Clothes-Changing Person ReID Model using Gallery Enrichment with Face Features

    Authors: Daniel Arkushin, Bar Cohen, Shmuel Peleg, Ohad Fried

    Abstract: In the Clothes-Changing Re-Identification (CC-ReID) problem, given a query sample of a person, the goal is to determine the correct identity based on a labeled gallery in which the person appears in different clothes. Several models tackle this challenge by extracting clothes-independent features. However, the performance of these models is still lower for the clothes-changing setting compared to… ▽ More

    Submitted 21 November, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

  48. arXiv:2211.09458  [pdf, other

    cs.CL

    Abstractive Summarization Guided by Latent Hierarchical Document Structure

    Authors: Yifu Qiu, Shay B. Cohen

    Abstract: Sequential abstractive neural summarizers often do not use the underlying structure in the input article or dependencies between the input sentences. This structure is essential to integrate and consolidate information from different parts of the text. To address this shortcoming, we propose a hierarchy-aware graph neural network (HierGNN) which captures such dependencies through three main steps:… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

    Comments: EMNLP 2022, 15 pages

  49. arXiv:2210.12553  [pdf, other

    cs.CL cs.LG

    Understanding Domain Learning in Language Models Through Subpopulation Analysis

    Authors: Zheng Zhao, Yftah Ziser, Shay B. Cohen

    Abstract: We investigate how different domains are encoded in modern neural network architectures. We analyze the relationship between natural language domains, model size, and the amount of training data used. The primary analysis tool we develop is based on subpopulation analysis with Singular Vector Canonical Correlation Analysis (SVCCA), which we apply to Transformer-based language models (LMs). We comp… ▽ More

    Submitted 22 October, 2022; originally announced October 2022.

    Comments: Accepted to BlackboxNLP 2022

  50. A Human-Centric Method for Generating Causal Explanations in Natural Language for Autonomous Vehicle Motion Planning

    Authors: Balint Gyevnar, Massimiliano Tamborski, Cheng Wang, Christopher G. Lucas, Shay B. Cohen, Stefano V. Albrecht

    Abstract: Inscrutable AI systems are difficult to trust, especially if they operate in safety-critical settings like autonomous driving. Therefore, there is a need to build transparent and queryable systems to increase trust levels. We propose a transparent, human-centric explanation generation method for autonomous vehicle motion planning and prediction based on an existing white-box system called IGP2. Ou… ▽ More

    Submitted 27 June, 2022; v1 submitted 17 June, 2022; originally announced June 2022.

    Comments: IJCAI Workshop on Artificial Intelligence for Autonomous Driving (AI4AD), 2022