Skip to main content

Showing 1–43 of 43 results for author: Stengel-Eskin, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2504.15485  [pdf, other

    cs.CV cs.AI cs.CL

    CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

    Authors: Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

    Abstract: Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a… ▽ More

    Submitted 21 April, 2025; originally announced April 2025.

    Comments: Code and data: https://github.com/atinpothiraj/CAPTURe

  2. arXiv:2504.13079  [pdf, other

    cs.CL cs.AI

    Retrieval-Augmented Generation with Conflicting Evidence

    Authors: Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied… ▽ More

    Submitted 17 April, 2025; originally announced April 2025.

    Comments: Our data and code is available at: https://github.com/HanNight/RAMDocs

  3. arXiv:2504.09763  [pdf, other

    cs.CL cs.AI cs.LG

    Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

    Authors: Zaid Khan, Elias Stengel-Eskin, Archiki Prasad, Jaemin Cho, Mohit Bansal

    Abstract: Scientists often infer abstract procedures from specific instances of problems and use the abstractions to generate new, related instances. For example, programs encoding the formal rules and properties of a system have been useful in fields ranging from RL (procedural environments) to physics (simulation engines). These programs can be seen as functions which execute to different outputs based on… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

    Comments: Project Page: https://zaidkhan.me/EFAGen/

  4. arXiv:2504.07389  [pdf, other

    cs.LG cs.AI cs.CL

    Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression

    Authors: Hanqi Xiao, Yi-Lin Sung, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantiz… ▽ More

    Submitted 9 April, 2025; originally announced April 2025.

    Comments: 24 pages. Code: https://github.com/The-Inscrutable-X/TACQ

  5. arXiv:2503.15272  [pdf, other

    cs.CL cs.AI

    MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration

    Authors: David Wan, Justin Chih-Yao Chen, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Multi-agent collaboration among models has shown promise in reasoning tasks but is underexplored in long-form generation tasks like summarization and question-answering. We extend multi-agent multi-model reasoning to generation, specifically to improving faithfulness through refinement, i.e., revising model-generated outputs to remove factual inconsistencies. We investigate how iterative collabora… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: NAACL 2025, 18 pages. Code: https://github.com/meetdavidwan/mammrefine

  6. arXiv:2503.05641  [pdf, other

    cs.CL cs.AI cs.LG

    Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning

    Authors: Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal

    Abstract: Combining existing pre-trained expert LLMs is a promising avenue for scalably tackling large-scale and diverse tasks. However, selecting experts at the task level is often too coarse-grained, as heterogeneous tasks may require different expertise for each instance. To enable adaptive instance-level mixing of pre-trained LLM experts, we propose Symbolic-MoE, a symbolic, text-based, and gradient-fre… ▽ More

    Submitted 11 March, 2025; v1 submitted 7 March, 2025; originally announced March 2025.

    Comments: The first three authors contributed equally. Project Page: https://symbolic-moe.github.io/

  7. arXiv:2502.15082  [pdf, other

    cs.LG cs.AI cs.CL

    UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning

    Authors: Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal

    Abstract: User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs). This requires deleting or "forgetting" a set of data points from an already-trained model, which typically degrades its performance on other data points. Thus, a balance must be struck between removing information and keeping the model's other abilities in… ▽ More

    Submitted 20 February, 2025; originally announced February 2025.

    Comments: Code: https://github.com/Vaidehi99/UPCORE

  8. arXiv:2502.14296  [pdf, other

    cs.CY

    On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

    Authors: Yue Huang, Chujie Gao, Siyuan Wu, Haoran Wang, Xiangqi Wang, Yujun Zhou, Yanbo Wang, Jiayi Ye, Jiawen Shi, Qihui Zhang, Yuan Li, Han Bao, Zhaoyi Liu, Tianrui Guan, Dongping Chen, Ruoxi Chen, Kehan Guo, Andy Zou, Bryan Hooi Kuen-Yew, Caiming Xiong, Elias Stengel-Eskin, Hongyang Zhang, Hongzhi Yin, Huan Zhang, Huaxiu Yao , et al. (41 additional authors not shown)

    Abstract: Generative Foundation Models (GenFMs) have emerged as transformative tools. However, their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions. First, we systematically review global AI governance laws and policies from governments and regulatory bodies, a… ▽ More

    Submitted 11 May, 2025; v1 submitted 20 February, 2025; originally announced February 2025.

  9. arXiv:2502.12446  [pdf, other

    cs.CL cs.AI cs.LG

    Multi-Attribute Steering of Language Models via Targeted Intervention

    Authors: Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM's parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducin… ▽ More

    Submitted 17 February, 2025; originally announced February 2025.

    Comments: 15 pages, code link: https://github.com/duykhuongnguyen/MAT-Steer

  10. arXiv:2502.01619  [pdf, other

    cs.SE cs.AI cs.CL cs.LG

    Learning to Generate Unit Tests for Automated Debugging

    Authors: Archiki Prasad, Elias Stengel-Eskin, Justin Chih-Yao Chen, Zaid Khan, Mohit Bansal

    Abstract: Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs), motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we p… ▽ More

    Submitted 26 February, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

    Comments: First two authors contributed equally. Dataset and Code: https://github.com/archiki/UTGenDebug

  11. arXiv:2410.14596  [pdf, other

    cs.CL cs.AI

    Teaching Models to Balance Resisting and Accepting Persuasion

    Authors: Elias Stengel-Eskin, Peter Hase, Mohit Bansal

    Abstract: Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve the… ▽ More

    Submitted 10 February, 2025; v1 submitted 18 October, 2024; originally announced October 2024.

    Comments: NAACL Camera-Ready. Code: https://github.com/esteng/persuasion_balanced_training

  12. arXiv:2410.06215  [pdf, other

    cs.CL cs.AI cs.LG

    DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback

    Authors: Zaid Khan, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

    Abstract: The process of creating training data to teach models is currently driven by humans, who manually analyze model weaknesses and plan how to create data that improves a student model. Approaches using LLMs as annotators reduce human effort, but still require humans to interpret feedback from evaluations and control the LLM to produce data the student needs. Automating this labor-intensive process by… ▽ More

    Submitted 13 March, 2025; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: ICLR 2025 Spotlight; Project Page: https://DataEnvGym.github.io

  13. arXiv:2410.01735  [pdf, other

    cs.CL cs.LG

    LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

    Authors: Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Reward Models (RMs) play a crucial role in aligning LLMs with human preferences, enhancing their performance by ranking outputs during inference or iterative training. However, the degree to which an RM generalizes to new tasks is often not known a priori (e.g. some RMs may excel at scoring creative writing vs. math reasoning). Therefore, using only one fixed RM while training LLMs can be suboptim… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

    Comments: 20 pages; First two authors contributed equally. Code: https://github.com/duykhuongnguyen/LASeR-MAB

  14. arXiv:2409.12147  [pdf, other

    cs.CL

    MAgICoRe: Multi-Agent, Iterative, Coarse-to-Fine Refinement for Reasoning

    Authors: Justin Chih-Yao Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Large Language Models' (LLM) reasoning can be improved using test-time aggregation strategies, i.e., generating multiple samples and voting among generated samples. While these improve performance, they often reach a saturation point. Refinement offers an alternative by using LLM-generated feedback to improve solution quality. However, refinement introduces 3 key challenges: (1) Excessive refineme… ▽ More

    Submitted 18 September, 2024; originally announced September 2024.

    Comments: 22 pages, code: https://github.com/dinobby/MAgICoRe

  15. arXiv:2409.07394  [pdf, other

    cs.CL

    AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge

    Authors: Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Knowledge conflict arises from discrepancies between information in the context of a large language model (LLM) and the knowledge stored in its parameters. This can hurt performance when using standard decoding techniques, which tend to ignore the context. Existing test-time contrastive methods seek to address this by comparing the LLM's output distribution with and without the context and adjust… ▽ More

    Submitted 28 April, 2025; v1 submitted 11 September, 2024; originally announced September 2024.

    Comments: NAACL 2025 (main conference), Code: https://github.com/HanNight/AdaCAD

  16. arXiv:2407.14414  [pdf, other

    cs.AI cs.CL cs.LG

    System-1.x: Learning to Balance Fast and Slow Planning with Language Models

    Authors: Swarnadeep Saha, Archiki Prasad, Justin Chih-Yao Chen, Peter Hase, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Language models can be used to solve long-horizon planning problems in two distinct modes: a fast 'System-1' mode, directly generating plans without any explicit search or backtracking, and a slow 'System-2' mode, planning step-by-step by explicitly searching over possible actions. While System-2 is typically more effective, it is also more computationally expensive, making it infeasible for long… ▽ More

    Submitted 14 April, 2025; v1 submitted 19 July, 2024; originally announced July 2024.

    Comments: ICLR 2025 (Camera-Ready)

  17. arXiv:2406.19354  [pdf, other

    cs.CL cs.AI

    Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?

    Authors: Peter Hase, Thomas Hofweber, Xiang Zhou, Elias Stengel-Eskin, Mohit Bansal

    Abstract: The model editing problem concerns how language models should learn new facts about the world over time. While empirical research on model editing has drawn widespread attention, the conceptual foundations of model editing remain shaky -- perhaps unsurprisingly, since model editing is essentially belief revision, a storied problem in philosophy that has eluded succinct solutions for decades. Model… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 23 pages, 4 figures

  18. arXiv:2406.11665  [pdf, other

    cs.CL cs.AI cs.CV

    See It from My Perspective: How Language Affects Cultural Bias in Image Understanding

    Authors: Amith Ananthram, Elias Stengel-Eskin, Mohit Bansal, Kathleen McKeown

    Abstract: Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from East Asian cultures attend more to scene context. In this work, we characterize the Western bias of VLMs in image understanding and investi… ▽ More

    Submitted 28 February, 2025; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Accepted at ICLR 2025. 22 pages, 6 figures. Code/models: https://github.com/amith-ananthram/see-it-from-my-perspective

  19. arXiv:2406.03442  [pdf, ps, other

    cs.CL cs.AI

    Are language models rational? The case of coherence norms and belief revision

    Authors: Thomas Hofweber, Peter Hase, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Do norms of rationality apply to machine learning models, in particular language models? In this paper we investigate this question by focusing on a special subset of rational norms: coherence norms. We consider both logical coherence norms as well as coherence norms tied to the strength of belief. To make sense of the latter, we introduce the Minimal Assent Connection (MAC) and propose a new acco… ▽ More

    Submitted 10 August, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

    Comments: added discussion and cross reference of new empirical work by the authors, updated references, fixed typos

  20. arXiv:2405.21028  [pdf, other

    cs.CL cs.AI

    LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models

    Authors: Elias Stengel-Eskin, Peter Hase, Mohit Bansal

    Abstract: When answering questions, LLMs can convey not only an answer, but a level of confidence about the answer being correct. This includes explicit confidence markers (e.g. giving a numeric score) as well as implicit markers, like an authoritative tone or elaborating with additional knowledge. For LLMs to be trustworthy knowledge sources, the confidence they convey should match their actual expertise;… ▽ More

    Submitted 3 July, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

    Comments: 18 pages. Code: https://github.com/esteng/pragmatic_calibration

  21. arXiv:2405.19209  [pdf, other

    cs.CV cs.AI cs.CL

    VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos

    Authors: Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal

    Abstract: Long-form video understanding is complicated by the high redundancy of video data and the abundance of query-irrelevant information. To tackle these challenges, we propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation for LLM reasoning over long-form videos. First, VideoTree extracts query-relevant information from the input video through… ▽ More

    Submitted 14 March, 2025; v1 submitted 29 May, 2024; originally announced May 2024.

    Comments: CVPR 2025; First three authors contributed equally; Project page: https://videotree2024.github.io/

  22. arXiv:2405.02749  [pdf, other

    cs.LG

    Sub-goal Distillation: A Method to Improve Small Language Agents

    Authors: Maryam Hashemzadeh, Elias Stengel-Eskin, Sarath Chandar, Marc-Alexandre Cote

    Abstract: While Large Language Models (LLMs) have demonstrated significant promise as agents in interactive tasks, their substantial computational requirements and restricted number of calls constrain their practical utility, especially in long-horizon interactive tasks such as decision-making or in scenarios involving continuous ongoing tasks. To address these constraints, we propose a method for transferr… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

  23. arXiv:2403.02325  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

    Authors: David Wan, Jaemin Cho, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Highlighting particularly relevant regions of an image can improve the performance of vision-language models (VLMs) on various vision-language (VL) tasks by guiding the model to attend more closely to these regions of interest. For example, VLMs can be given a "visual prompt", where visual markers such as bounding boxes delineate key image regions. However, current VLMs that can incorporate visual… ▽ More

    Submitted 4 March, 2024; originally announced March 2024.

    Comments: Project website: https://contrastive-region-guidance.github.io/

  24. arXiv:2402.16354  [pdf, other

    cs.LG cs.AI cs.CL

    Language-guided Skill Learning with Temporal Variational Inference

    Authors: Haotian Fu, Pratyusha Sharma, Elias Stengel-Eskin, George Konidaris, Nicolas Le Roux, Marc-Alexandre Côté, Xingdi Yuan

    Abstract: We present an algorithm for skill discovery from expert demonstrations. The algorithm first utilizes Large Language Models (LLMs) to propose an initial segmentation of the trajectories. Following that, a hierarchical variational inference framework incorporates the LLM-generated segmentation information to discover reusable skills by merging trajectory segments. To further control the trade-off be… ▽ More

    Submitted 27 May, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: ICML 2024

  25. arXiv:2402.13212  [pdf, other

    cs.CL cs.AI cs.LG

    Soft Self-Consistency Improves Language Model Agents

    Authors: Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Generations from large language models (LLMs) can be improved by sampling and scoring multiple solutions to select a final answer. Current "sample and select" methods such as self-consistency (SC) rely on majority voting to score answers. However, when tasks have many distinct and valid answers, selection by voting requires a large number of samples. This makes SC prohibitively expensive for inter… ▽ More

    Submitted 5 June, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

    Comments: ACL 2024 Camera-Ready, the first three authors contributed equally; Code: https://github.com/HanNight/soft_self_consistency

  26. arXiv:2402.12348  [pdf, other

    cs.CL cs.AI cs.LG

    GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

    Authors: Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, Kaidi Xu

    Abstract: As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a langu… ▽ More

    Submitted 10 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: 26 pages; the first two authors contributed equally; GTBench HF Leaderboard: https://huggingface.co/spaces/GTBench/GTBench

  27. arXiv:2402.01620  [pdf, other

    cs.CL

    MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

    Authors: Justin Chih-Yao Chen, Swarnadeep Saha, Elias Stengel-Eskin, Mohit Bansal

    Abstract: Multi-agent interactions between Large Language Model (LLM) agents have shown major improvements on diverse reasoning tasks. However, these involve long generations from multiple models across several rounds, making them expensive. Moreover, these multi-agent approaches fail to provide a final, single model for efficient inference. To address this, we introduce MAGDi, a new method for structured d… ▽ More

    Submitted 7 June, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: ICML 2024 (Camera-ready); First two authors contributed equally; GitHub: https://github.com/dinobby/MAGDi

  28. arXiv:2401.16467  [pdf, other

    cs.SE cs.AI cs.CL cs.LG cs.PL

    ReGAL: Refactoring Programs to Discover Generalizable Abstractions

    Authors: Elias Stengel-Eskin, Archiki Prasad, Mohit Bansal

    Abstract: While large language models (LLMs) are increasingly being used for program synthesis, they lack the global view needed to develop useful abstractions; they generally predict programs one at a time, often repeating the same functionality. Generating redundant code from scratch is both inefficient and error-prone. To address this, we propose Refactoring for Generalizable Abstraction Learning (ReGAL)… ▽ More

    Submitted 6 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: ICML 2024 Camera-Ready; First two authors contributed equally; Code: https://github.com/esteng/regal_program_learning

  29. arXiv:2310.05861  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models

    Authors: Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

    Abstract: An increasing number of vision-language tasks can be handled with little to no training, i.e., in a zero and few-shot manner, by marrying large language models (LLMs) to vision encoders, resulting in large vision-language models (LVLMs). While this has huge upsides, such as not requiring training data or custom architectures, how an input is presented to an LVLM can have a major impact on zero-sho… ▽ More

    Submitted 2 April, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: ICLR 2024 camera-ready (23 pages), Code: https://github.com/archiki/RepARe

  30. arXiv:2306.00824  [pdf, other

    cs.CL

    Zero and Few-shot Semantic Parsing with Ambiguous Inputs

    Authors: Elias Stengel-Eskin, Kyle Rawlins, Benjamin Van Durme

    Abstract: Despite the frequent challenges posed by ambiguity when representing meaning via natural language, it is often ignored or deliberately removed in tasks mapping language to formally-designed representations, which generally assume a one-to-one mapping between linguistic and formal representations. We attempt to address this shortcoming by introducing AmP, a framework, dataset, and challenge for tra… ▽ More

    Submitted 22 January, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

    Comments: ICLR 2024 Camera Ready

  31. arXiv:2303.16857  [pdf, other

    cs.CL

    Did You Mean...? Confidence-based Trade-offs in Semantic Parsing

    Authors: Elias Stengel-Eskin, Benjamin Van Durme

    Abstract: We illustrate how a calibrated model can help balance common trade-offs in task-oriented parsing. In a simulated annotator-in-the-loop experiment, we show that well-calibrated confidence scores allow us to balance cost with annotator load, improving accuracy with a small number of interactions. We then examine how confidence scores can help optimize the trade-off between usability and safety. We s… ▽ More

    Submitted 20 October, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: EMNLP 2023, Camera ready. arXiv admin note: substantial text overlap with arXiv:2211.07443

  32. arXiv:2212.00259  [pdf, other

    cs.CV cs.CL

    Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning

    Authors: Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, Alan Yuille

    Abstract: Visual Question Answering (VQA) models often perform poorly on out-of-distribution data and struggle on domain generalization. Due to the multi-modal nature of this task, multiple factors of variation are intertwined, making generalization difficult to analyze. This motivates us to introduce a virtual benchmark, Super-CLEVR, where different factors in VQA domain shifts can be isolated in order tha… ▽ More

    Submitted 31 May, 2023; v1 submitted 30 November, 2022; originally announced December 2022.

    Comments: Published in CVPR 2023 as Highlight. Data and code are released at https://github.com/Lizw14/Super-CLEVR

  33. arXiv:2211.07516  [pdf, other

    cs.CL

    Why Did the Chicken Cross the Road? Rephrasing and Analyzing Ambiguous Questions in VQA

    Authors: Elias Stengel-Eskin, Jimena Guallar-Blasco, Yi Zhou, Benjamin Van Durme

    Abstract: Natural language is ambiguous. Resolving ambiguous questions is key to successfully answering them. Focusing on questions about images, we create a dataset of ambiguous examples. We annotate these, grouping answers by the underlying question they address and rephrasing the question for each group to reduce ambiguity. Our analysis reveals a linguistically-aligned ontology of reasons for ambiguity i… ▽ More

    Submitted 1 June, 2023; v1 submitted 14 November, 2022; originally announced November 2022.

    Comments: ACL 2023. Code and data: https://github.com/esteng/ambiguous_vqa

  34. arXiv:2211.07443  [pdf, other

    cs.CL

    Calibrated Interpretation: Confidence Estimation in Semantic Parsing

    Authors: Elias Stengel-Eskin, Benjamin Van Durme

    Abstract: Sequence generation models are increasingly being used to translate natural language into programs, i.e. to perform executable semantic parsing. The fact that semantic parsing aims to predict programs that can lead to executed actions in the real world motivates developing safe systems. This in turn makes measuring calibration -- a central component to safety -- particularly important. We investig… ▽ More

    Submitted 6 July, 2023; v1 submitted 14 November, 2022; originally announced November 2022.

    Comments: TACL Camera-ready

  35. arXiv:2205.12228  [pdf, other

    cs.CL

    When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems

    Authors: Elias Stengel-Eskin, Emmanouil Antonios Platanios, Adam Pauls, Sam Thomson, Hao Fang, Benjamin Van Durme, Jason Eisner, Yu Su

    Abstract: In natural language understanding (NLU) production systems, users' evolving needs necessitate the addition of new features over time, indexed by new symbols added to the meaning representation space. This requires additional training data and results in ever-growing datasets. We present the first systematic investigation of this incremental symbol learning scenario. Our analysis reveals a troublin… ▽ More

    Submitted 8 November, 2022; v1 submitted 24 May, 2022; originally announced May 2022.

    Comments: EMNLP 2022

  36. arXiv:2205.12113  [pdf, other

    cs.CL

    The Curious Case of Control

    Authors: Elias Stengel-Eskin, Benjamin Van Durme

    Abstract: Children acquiring English make systematic errors on subject control sentences even after they have reached near-adult competence (C. Chomsky, 1969), possibly due to heuristics based on semantic roles (Maratsos, 1974). Given the advanced fluency of large generative language models, we ask whether model outputs are consistent with these heuristics, and to what degree different models are consistent… ▽ More

    Submitted 8 November, 2022; v1 submitted 24 May, 2022; originally announced May 2022.

    Comments: EMNLP 2022

  37. arXiv:2205.01850  [pdf, other

    cs.CL cs.CV

    Visual Commonsense in Pretrained Unimodal and Multimodal Models

    Authors: Chenyu Zhang, Benjamin Van Durme, Zhuowan Li, Elias Stengel-Eskin

    Abstract: Our commonsense knowledge about objects includes their typical visual attributes; we know that bananas are typically yellow or green, and not purple. Text and image corpora, being subject to reporting bias, represent this world-knowledge to varying degrees of faithfulness. In this paper, we investigate to what degree unimodal (language-only) and multimodal (image and language) models capture a bro… ▽ More

    Submitted 3 May, 2022; originally announced May 2022.

    Comments: To appear in NAACL 2022

  38. arXiv:2110.00519  [pdf, other

    cs.CV cs.CL

    Calibrating Concepts and Operations: Towards Symbolic Reasoning on Real Images

    Authors: Zhuowan Li, Elias Stengel-Eskin, Yixiao Zhang, Cihang Xie, Quan Tran, Benjamin Van Durme, Alan Yuille

    Abstract: While neural symbolic methods demonstrate impressive performance in visual question answering on synthetic images, their performance suffers on real images. We identify that the long-tail distribution of visual concepts and unequal importance of reasoning steps in real data are the two key obstacles that limit the models' real-world potentials. To address these challenges, we propose a new paradig… ▽ More

    Submitted 1 October, 2021; originally announced October 2021.

    Comments: To appear in ICCV2021; Code at https://github.com/Lizw14/CaliCO.git

  39. arXiv:2104.05696  [pdf, other

    cs.CL

    Joint Universal Syntactic and Semantic Parsing

    Authors: Elias Stengel-Eskin, Kenton Murray, Sheng Zhang, Aaron Steven White, Benjamin Van Durme

    Abstract: While numerous attempts have been made to jointly parse syntax and semantics, high performance in one domain typically comes at the price of performance in the other. This trade-off contradicts the large body of research focusing on the rich interactions at the syntax-semantics interface. We explore multiple model architectures which allow us to exploit the rich syntactic and semantic annotations… ▽ More

    Submitted 12 April, 2021; originally announced April 2021.

    Comments: To appear: TACL 2021

  40. arXiv:2007.00320  [pdf, other

    cs.CL

    Iterative Paraphrastic Augmentation with Discriminative Span Alignment

    Authors: Ryan Culkin, J. Edward Hu, Elias Stengel-Eskin, Guanghui Qin, Benjamin Van Durme

    Abstract: We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing resources, or the rapid creation of new resources from a small, manually-produced seed corpus. We illustrate our framework on the Berkeley FrameNet Project, a large-scale language understa… ▽ More

    Submitted 1 July, 2020; originally announced July 2020.

  41. arXiv:1910.10138  [pdf, other

    cs.CL

    Universal Decompositional Semantic Parsing

    Authors: Elias Stengel-Eskin, Aaron Steven White, Sheng Zhang, Benjamin Van Durme

    Abstract: We introduce a transductive model for parsing into Universal Decompositional Semantics (UDS) representations, which jointly learns to map natural language utterances into UDS graph structures and annotate the graph with decompositional semantic attribute scores. We also introduce a strong pipeline model for parsing into the UDS graph structure, and show that our transductive parser performs compar… ▽ More

    Submitted 2 May, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: ACL 2020

  42. arXiv:1909.13851  [pdf, other

    cs.CL

    The Universal Decompositional Semantics Dataset and Decomp Toolkit

    Authors: Aaron Steven White, Elias Stengel-Eskin, Siddharth Vashishtha, Venkata Govindarajan, Dee Ann Reisinger, Tim Vieira, Keisuke Sakaguchi, Sheng Zhang, Francis Ferraro, Rachel Rudinger, Kyle Rawlins, Benjamin Van Durme

    Abstract: We present the Universal Decompositional Semantics (UDS) dataset (v1.0), which is bundled with the Decomp toolkit (v0.1). UDS1.0 unifies five high-quality, decompositional semantics-aligned annotation sets within a single semantic graph specification---with graph structures defined by the predicative patterns produced by the PredPatt tool and real-valued node and edge attributes constructed using… ▽ More

    Submitted 30 September, 2019; originally announced September 2019.

  43. arXiv:1909.00444  [pdf, other

    cs.CL

    A Discriminative Neural Model for Cross-Lingual Word Alignment

    Authors: Elias Stengel-Eskin, Tzu-Ray Su, Matt Post, Benjamin Van Durme

    Abstract: We introduce a novel discriminative word alignment model, which we integrate into a Transformer-based machine translation model. In experiments based on a small number of labeled examples (~1.7K-5K sentences) we evaluate its performance intrinsically on both English-Chinese and English-Arabic alignment, where we achieve major improvements over unsupervised baselines (11-27 F1). We evaluate the mod… ▽ More

    Submitted 1 September, 2019; originally announced September 2019.

    Comments: EMNLP 2019