Skip to main content

Showing 1–21 of 21 results for author: Benajiba, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.19457  [pdf, other

    cs.CL cs.AI

    Towards Long Context Hallucination Detection

    Authors: Siyi Liu, Kishaloy Halder, Zheng Qi, Wei Xiao, Nikolaos Pappas, Phu Mon Htut, Neha Anna John, Yassine Benajiba, Dan Roth

    Abstract: Large Language Models (LLMs) have demonstrated remarkable performance across various tasks. However, they are prone to contextual hallucination, generating information that is either unsubstantiated or contradictory to the given context. Although many studies have investigated contextual hallucinations in LLMs, addressing them in long-context inputs remains an open problem. In this work, we take a… ▽ More

    Submitted 27 April, 2025; originally announced April 2025.

  2. arXiv:2503.21760  [pdf, ps, other

    cs.CL

    MemInsight: Autonomous Memory Augmentation for LLM Agents

    Authors: Rana Salama, Jason Cai, Michelle Yuan, Anna Currey, Monica Sunkara, Yi Zhang, Yassine Benajiba

    Abstract: Large language model (LLM) agents have evolved to intelligently process information, make decisions, and interact with users or tools. A key capability is the integration of long-term memory capabilities, enabling these agents to draw upon historical interactions and knowledge. However, the growing memory size and need for semantic structuring pose significant challenges. In this work, we propose… ▽ More

    Submitted 31 July, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

  3. arXiv:2502.12094  [pdf, other

    cs.AI cs.CL

    A Study on Leveraging Search and Self-Feedback for Agent Reasoning

    Authors: Karthikeyan K, Michelle Yuan, Elman Mansimov, Katerina Margatina, Anurag Pratik, Daniele Bonadiman, Monica Sunkara, Yi Zhang, Yassine Benajiba

    Abstract: Recent works have demonstrated that incorporating search during inference can significantly improve reasoning capabilities of language agents. Some approaches may make use of the ground truth or rely on model's own generated feedback. The search algorithm uses this feedback to then produce values that will update its criterion for exploring and exploiting various reasoning paths. In this study, we… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

    Comments: Under review

  4. arXiv:2502.01630  [pdf, ps, other

    cs.AI

    TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues

    Authors: Yubin Ge, Salvatore Romeo, Jason Cai, Raphael Shu, Monica Sunkara, Yassine Benajiba, Yi Zhang

    Abstract: Temporal reasoning in multi-session dialogues presents a significant challenge which has been under-studied in previous temporal reasoning benchmarks. To bridge this gap, we propose a new evaluation task for temporal reasoning in multi-session dialogues and introduce an approach to construct a new benchmark by augmenting dialogues from LoCoMo and creating multi-choice QAs. Furthermore, we present… ▽ More

    Submitted 24 September, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

    Comments: Accepted at ACL 2025 Findings

  5. arXiv:2502.00996  [pdf, other

    cs.CL

    Self-supervised Analogical Learning using Language Models

    Authors: Ben Zhou, Sarthak Jain, Yi Zhang, Qiang Ning, Shuai Wang, Yassine Benajiba, Dan Roth

    Abstract: Large language models have been shown to suffer from reasoning inconsistency issues. That is, they fail more in situations unfamiliar to the training data, even though exact or very similar reasoning paths exist in more common cases that they can successfully solve. Such observations motivate us to propose methods that encourage models to understand the high-level and abstract reasoning processes… ▽ More

    Submitted 2 February, 2025; originally announced February 2025.

  6. arXiv:2412.09572  [pdf, other

    cs.CL

    DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction

    Authors: Yu Feng, Phu Mon Htut, Zheng Qi, Wei Xiao, Manuel Mager, Nikolaos Pappas, Kishaloy Halder, Yang Li, Yassine Benajiba, Dan Roth

    Abstract: Quantifying the uncertainty in the factual parametric knowledge of Large Language Models (LLMs), especially in a black-box setting, poses a significant challenge. Existing methods, which gauge a model's uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty. Models might respond consistently to the origin query with a wrong answer… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

  7. arXiv:2410.19206  [pdf, other

    cs.LG cs.CL

    Inference time LLM alignment in single and multidomain preference spectrum

    Authors: Sadat Shahriar, Zheng Qi, Nikolaos Pappas, Srikanth Doss, Monica Sunkara, Kishaloy Halder, Manuel Mager, Yassine Benajiba

    Abstract: Aligning Large Language Models (LLM) to address subjectivity and nuanced preference levels requires adequate flexibility and control, which can be a resource-intensive and time-consuming procedure. Existing training-time alignment methods require full re-training when a change is needed and inference-time ones typically require access to the reward model at each inference step. To address these li… ▽ More

    Submitted 24 October, 2024; originally announced October 2024.

  8. arXiv:2410.12311  [pdf, other

    cs.CL cs.AI

    Open Domain Question Answering with Conflicting Contexts

    Authors: Siyi Liu, Qiang Ning, Kishaloy Halder, Wei Xiao, Zheng Qi, Phu Mon Htut, Yi Zhang, Neha Anna John, Bonan Min, Yassine Benajiba, Dan Roth

    Abstract: Open domain question answering systems frequently rely on information retrieved from large collections of text (such as the Web) to answer questions. However, such collections of text often contain conflicting information, and indiscriminately depending on this information may result in untruthful and inaccurate answers. To understand the gravity of this problem, we collect a human-annotated datas… ▽ More

    Submitted 27 April, 2025; v1 submitted 16 October, 2024; originally announced October 2024.

  9. arXiv:2410.09047  [pdf, other

    cs.CL cs.AI cs.LG

    Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models

    Authors: Qin Liu, Chao Shang, Ling Liu, Nikolaos Pappas, Jie Ma, Neha Anna John, Srikanth Doss, Lluis Marquez, Miguel Ballesteros, Yassine Benajiba

    Abstract: The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module compared to its LLM backbone. We investigate this phenomenon, dubbed as ''safety alignment degradation'' in this paper, and show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs. In particular, we show that the repr… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

    Comments: Preprint

  10. arXiv:2410.05952  [pdf, other

    cs.LG

    Active Evaluation Acquisition for Efficient LLM Benchmarking

    Authors: Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, Graham Horwood

    Abstract: As large language models (LLMs) become increasingly versatile, numerous large scale benchmarks have been developed to thoroughly assess their capabilities. These benchmarks typically consist of diverse datasets and prompts to evaluate different aspects of LLM performance. However, comprehensive evaluations on hundreds or thousands of prompts incur tremendous costs in terms of computation, money, a… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

  11. arXiv:2405.00204  [pdf, other

    cs.CL cs.AI

    General Purpose Verification for Chain of Thought Prompting

    Authors: Robert Vacareanu, Anurag Pratik, Evangelia Spiliopoulou, Zheng Qi, Giovanni Paolini, Neha Anna John, Jie Ma, Yassine Benajiba, Miguel Ballesteros

    Abstract: Many of the recent capabilities demonstrated by Large Language Models (LLMs) arise primarily from their ability to exploit contextual information. In this paper, we explore ways to improve reasoning capabilities of LLMs through (1) exploration of different chains of thought and (2) validation of the individual steps of the reasoning process. We propose three general principles that a model should… ▽ More

    Submitted 30 April, 2024; originally announced May 2024.

    Comments: 22 pages, preprint

  12. arXiv:2403.06326  [pdf, other

    cs.CL cs.AI cs.LG

    From Instructions to Constraints: Language Model Alignment with Automatic Constraint Verification

    Authors: Fei Wang, Chao Shang, Sarthak Jain, Shuai Wang, Qiang Ning, Bonan Min, Vittorio Castelli, Yassine Benajiba, Dan Roth

    Abstract: User alignment is crucial for adapting general-purpose language models (LMs) to downstream tasks, but human annotations are often not available for all types of instructions, especially those with customized constraints. We observe that user instructions typically contain constraints. While assessing response quality in terms of the whole instruction is often costly, efficiently evaluating the sat… ▽ More

    Submitted 10 March, 2024; originally announced March 2024.

  13. arXiv:2402.18479  [pdf, other

    cs.CL

    NewsQs: Multi-Source Question Generation for the Inquiring Mind

    Authors: Alyssa Hwang, Kalpit Dixit, Miguel Ballesteros, Yassine Benajiba, Vittorio Castelli, Markus Dreyer, Mohit Bansal, Kathleen McKeown

    Abstract: We present NewsQs (news-cues), a dataset that provides question-answer pairs for multiple news documents. To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles from the News On the Web corpus. We show that fine-tuning a model with control codes produces questions that are judg… ▽ More

    Submitted 15 June, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: minor wording change

  14. arXiv:2305.17127  [pdf, other

    cs.CL

    Characterizing and Measuring Linguistic Dataset Drift

    Authors: Tyler A. Chang, Kishaloy Halder, Neha Anna John, Yogarshi Vyas, Yassine Benajiba, Miguel Ballesteros, Dan Roth

    Abstract: NLP models often degrade in performance when real world data distributions differ markedly from training data. However, existing dataset drift metrics in NLP have generally not considered specific dimensions of linguistic drift that affect model performance, and they have not been validated in their ability to predict model performance at the individual example level, where such metrics are often… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023

  15. Diable: Efficient Dialogue State Tracking as Operations on Tables

    Authors: Pietro Lesci, Yoshinari Fujinuma, Momchil Hardalov, Chao Shang, Yassine Benajiba, Lluis Marquez

    Abstract: Sequence-to-sequence state-of-the-art systems for dialogue state tracking (DST) use the full dialogue history as input, represent the current state as a list with all the slots, and generate the entire state from scratch at each dialogue turn. This approach is inefficient, especially when the number of slots is large and the conversation is long. We propose Diable, a new task formalisation that si… ▽ More

    Submitted 1 November, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023 (Findings)

    Journal ref: Findings of the Association for Computational Linguistics: ACL 2023

  16. arXiv:2305.13191  [pdf, other

    cs.CL cs.AI cs.LG

    Taxonomy Expansion for Named Entity Recognition

    Authors: Karthikeyan K, Yogarshi Vyas, Jie Ma, Giovanni Paolini, Neha Anna John, Shuai Wang, Yassine Benajiba, Vittorio Castelli, Dan Roth, Miguel Ballesteros

    Abstract: Training a Named Entity Recognition (NER) model often involves fixing a taxonomy of entity types. However, requirements evolve and we might need the NER model to recognize additional entity types. A simple approach is to re-annotate entire dataset with both existing and additional entity types and then train the model on the re-annotated dataset. However, this is an extremely laborious task. To re… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  17. arXiv:2304.12982  [pdf, other

    cs.CL

    Intent Induction from Conversations for Task-Oriented Dialogue Track at DSTC 11

    Authors: James Gung, Raphael Shu, Emily Moeng, Wesley Rose, Salvatore Romeo, Yassine Benajiba, Arshit Gupta, Saab Mansour, Yi Zhang

    Abstract: With increasing demand for and adoption of virtual assistants, recent work has investigated ways to accelerate bot schema design through the automatic induction of intents or the induction of slots and dialogue states. However, a lack of dedicated benchmarks and standardized evaluation has made progress difficult to track and comparisons between systems difficult to make. This challenge track, hel… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

    Comments: 18 pages, 1 figure. Accepted at the DSTC 11 Workshop to be located at SIGDIAL 2023

  18. arXiv:2303.11660  [pdf, other

    cs.CL

    Simple Yet Effective Synthetic Dataset Construction for Unsupervised Opinion Summarization

    Authors: Ming Shen, Jie Ma, Shuai Wang, Yogarshi Vyas, Kalpit Dixit, Miguel Ballesteros, Yassine Benajiba

    Abstract: Opinion summarization provides an important solution for summarizing opinions expressed among a large number of reviews. However, generating aspect-specific and general summaries is challenging due to the lack of annotated data. In this work, we propose two simple yet effective unsupervised approaches to generate both aspect-specific and general opinion summaries by training on synthetic datasets… ▽ More

    Submitted 21 March, 2023; originally announced March 2023.

    Comments: EACL 2023 Findings

  19. arXiv:2302.12297  [pdf, other

    cs.CL

    Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views

    Authors: Katerina Margatina, Shuai Wang, Yogarshi Vyas, Neha Anna John, Yassine Benajiba, Miguel Ballesteros

    Abstract: Temporal concept drift refers to the problem of data changing over time. In NLP, that would entail that language (e.g. new expressions, meaning shifts) and factual knowledge (e.g. new concepts, updated facts) evolve over time. Focusing on the latter, we benchmark $11$ pretrained masked language models (MLMs) on a series of tests designed to evaluate the effect of temporal concept drift, as it is c… ▽ More

    Submitted 23 February, 2023; originally announced February 2023.

    Comments: To appear at EACL 2023. Our code will be available at https://github.com/amazon-science/temporal-robustness

  20. arXiv:2210.06629  [pdf, other

    cs.CL

    Instruction Tuning for Few-Shot Aspect-Based Sentiment Analysis

    Authors: Siddharth Varia, Shuai Wang, Kishaloy Halder, Robert Vacareanu, Miguel Ballesteros, Yassine Benajiba, Neha Anna John, Rishita Anubhai, Smaranda Muresan, Dan Roth

    Abstract: Aspect-based Sentiment Analysis (ABSA) is a fine-grained sentiment analysis task which involves four elements from user-generated texts: aspect term, aspect category, opinion term, and sentiment polarity. Most computational approaches focus on some of the ABSA sub-tasks such as tuple (aspect term, sentiment polarity) or triplet (aspect term, opinion term, sentiment polarity) extraction using eithe… ▽ More

    Submitted 11 June, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: Camera ready copy for WASSA at ACL 2023

  21. arXiv:1812.06604  [pdf, other

    cs.CL

    Siamese Networks for Semantic Pattern Similarity

    Authors: Yassine Benajiba, Jin Sun, Yong Zhang, Longquan Jiang, Zhiliang Weng, Or Biran

    Abstract: Semantic Pattern Similarity is an interesting, though not often encountered NLP task where two sentences are compared not by their specific meaning, but by their more abstract semantic pattern (e.g., preposition or frame). We utilize Siamese Networks to model this task, and show its usefulness in determining SQL patterns for unseen questions in a database-backed question answering scenario. Our ap… ▽ More

    Submitted 16 December, 2018; originally announced December 2018.