Skip to main content

Showing 1–12 of 12 results for author: Boyd-Graber, J L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.15068  [pdf, ps, other

    cs.CL cs.LG

    Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

    Authors: Zongxia Li, Yapei Chang, Yuhang Zhou, Xiyang Wu, Zichao Liang, Yoo Yeon Sung, Jordan Lee Boyd-Graber

    Abstract: Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-f… ▽ More

    Submitted 17 June, 2025; originally announced June 2025.

  2. arXiv:2505.01481  [pdf, ps, other

    cs.CV cs.LG

    VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

    Authors: Zongxia Li, Xiyang Wu, Guangyao Shi, Yubin Qin, Hongyang Du, Tianyi Zhou, Dinesh Manocha, Jordan Lee Boyd-Graber

    Abstract: Synthetic video generation has gained significant attention for its realism and broad applications, but remains prone to violations of common sense and physical laws. This highlights the need for reliable abnormality detectors that understand such principles and are robust to hallucinations. To address this, we introduce VideoHallu, a benchmark of over 3,000 video QA pairs built from synthetic vid… ▽ More

    Submitted 18 June, 2025; v1 submitted 2 May, 2025; originally announced May 2025.

  3. arXiv:2503.06778  [pdf, other

    cs.CL cs.AI

    Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators

    Authors: Feng Gu, Zongxia Li, Carlos Rafael Colon, Benjamin Evans, Ishani Mondal, Jordan Lee Boyd-Graber

    Abstract: Event annotation is important for identifying market changes, monitoring breaking news, and understanding sociological trends. Although expert annotators set the gold standards, human coding is expensive and inefficient. Unlike information extraction experiments that focus on single contexts, we evaluate a holistic workflow that removes irrelevant documents, merges documents about the same event,… ▽ More

    Submitted 5 April, 2025; v1 submitted 9 March, 2025; originally announced March 2025.

    Comments: 9 pages, 4 figures

  4. arXiv:2502.19684  [pdf, other

    cs.CL

    GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

    Authors: Yoo Yeon Sung, Eve Fleisig, Yu Hou, Ishan Upadhyay, Jordan Lee Boyd-Graber

    Abstract: Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possibl… ▽ More

    Submitted 26 February, 2025; originally announced February 2025.

  5. arXiv:2502.14127  [pdf, ps, other

    cs.CL

    Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

    Authors: Nishant Balepur, Rachel Rudinger, Jordan Lee Boyd-Graber

    Abstract: Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing, where LLMs construct and explain answers… ▽ More

    Submitted 31 May, 2025; v1 submitted 19 February, 2025; originally announced February 2025.

    Comments: ACL 2025

  6. arXiv:2502.12436  [pdf, ps, other

    cs.CL

    Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL

    Authors: Wichayaporn Wongkamjan, Yanze Wang, Feng Gu, Denis Peskoff, Jonathan K. Kummerfeld, Jonathan May, Jordan Lee Boyd-Graber

    Abstract: An increasingly common socio-technical problem is people being taken in by offers that sound ``too good to be true'', where persuasion and trust shape decision-making. This paper investigates how \abr{ai} can help detect these deceptive scenarios. We analyze how humans strategically deceive each other in \textit{Diplomacy}, a board game that requires both natural language communication and strateg… ▽ More

    Submitted 5 June, 2025; v1 submitted 17 February, 2025; originally announced February 2025.

    Comments: ACL Findings 2025

  7. arXiv:2501.11549  [pdf, ps, other

    cs.CL

    Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas

    Authors: Nishant Balepur, Vishakh Padmakumar, Fumeng Yang, Shi Feng, Rachel Rudinger, Jordan Lee Boyd-Graber

    Abstract: LLMs are aligned to follow input instructions by learning which of two responses users prefer for a prompt. However, such preference data do not convey why users prefer responses that are chosen or rejected, so LLMs trained on these datasets cannot tailor responses to varied user needs. To surface these parameters of personalization, we apply abductive reasoning to preference data, inferring needs… ▽ More

    Submitted 31 May, 2025; v1 submitted 20 January, 2025; originally announced January 2025.

    Comments: ACL 2025

  8. arXiv:2406.16342  [pdf, other

    cs.CL

    Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness

    Authors: Yoo Yeon Sung, Maharshi Gor, Eve Fleisig, Ishani Mondal, Jordan Lee Boyd-Graber

    Abstract: Adversarial datasets should validate AI robustness by providing samples on which humans perform well, but models do not. However, as models evolve, datasets can become obsolete. Measuring whether a dataset remains adversarial is hindered by the lack of a standardized metric for measuring adversarialness. We propose AdvScore, a human-grounded evaluation metric that assesses a dataset's adversarialn… ▽ More

    Submitted 18 February, 2025; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: arXiv admin note: text overlap with arXiv:2401.11185

  9. arXiv:2406.10900  [pdf, other

    cs.CV cs.CL

    AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

    Authors: Xiyang Wu, Tianrui Guan, Dianqi Li, Shuaiyi Huang, Xiaoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shrivastava, Furong Huang, Jordan Lee Boyd-Graber, Tianyi Zhou, Dinesh Manocha

    Abstract: Large vision-language models (LVLMs) are prone to hallucinations, where certain contextual cues in an image can trigger the language module to produce overconfident and incorrect reasoning about abnormal or hypothetical objects. While some benchmarks have been developed to investigate LVLM hallucinations, they often rely on hand-crafted corner cases whose failure patterns may not generalize well.… ▽ More

    Submitted 8 October, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

  10. More Victories, Less Cooperation: Assessing Cicero's Diplomacy Play

    Authors: Wichayaporn Wongkamjan, Feng Gu, Yanze Wang, Ulf Hermjakob, Jonathan May, Brandon M. Stewart, Jonathan K. Kummerfeld, Denis Peskoff, Jordan Lee Boyd-Graber

    Abstract: The boardgame Diplomacy is a challenging setting for communicative and cooperative artificial intelligence. The most prominent communicative Diplomacy AI, Cicero, has excellent strategic abilities, exceeding human players. However, the best Diplomacy players master communication, not just tactics, which is why the game has received attention as an AI challenge. This work seeks to understand the de… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  11. arXiv:2402.11161  [pdf, other

    cs.CL cs.AI

    PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

    Authors: Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Lee Boyd-Graber

    Abstract: Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs). There are two challenges with current short-form QA evaluations: a lack of diverse styles of evaluation data and an over-reliance on expensive and slow LLMs. LLM-based scorers correlate better with… ▽ More

    Submitted 11 October, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: Efficient PEDANTS Classifier for short-form QA in github: https://github.com/zli12321/qa_metrics. arXiv admin note: text overlap with arXiv:2401.13170

    Journal ref: Empirical Methods in Natural Language Processing 2024

  12. arXiv:2312.01308  [pdf, other

    cs.CL

    Bridging Background Knowledge Gaps in Translation with Automatic Explicitation

    Authors: HyoJung Han, Jordan Lee Boyd-Graber, Marine Carpuat

    Abstract: Translations help people understand content written in another language. However, even correct literal translations do not fulfill that goal when people lack the necessary background to understand them. Professional translators incorporate explicitations to explain the missing context by considering cultural differences between source and target audiences. Despite its potential to help users, NLP… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

    Comments: EMNLP2023