Skip to main content

Showing 1–7 of 7 results for author: Vasselli, J

.
  1. arXiv:2506.01535  [pdf, ps, other

    cs.CL cs.AI

    Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries

    Authors: Haruki Sakajo, Yusuke Ide, Justin Vasselli, Yusuke Sakai, Yingtao Tian, Hidetaka Kamigaito, Taro Watanabe

    Abstract: Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages. Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources. In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which a… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Accepted to ACL 2025 Findings

  2. arXiv:2501.06728  [pdf, other

    cs.CL

    Measuring the Robustness of Reference-Free Dialogue Evaluation Systems

    Authors: Justin Vasselli, Adam Nohejl, Taro Watanabe

    Abstract: Advancements in dialogue systems powered by large language models (LLMs) have outpaced the development of reliable evaluation metrics, particularly for diverse and creative responses. We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks: speaker tag prefixes, static responses, ungrammatical responses, and repeated co… ▽ More

    Submitted 12 January, 2025; originally announced January 2025.

  3. arXiv:2412.18151  [pdf, ps, other

    cs.CL

    CoAM: Corpus of All-Type Multiword Expressions

    Authors: Yusuke Ide, Joshua Tanner, Adam Nohejl, Jacob Hoffman, Justin Vasselli, Hidetaka Kamigaito, Taro Watanabe

    Abstract: Multiword expressions (MWEs) refer to idiomatic sequences of multiple words. MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size. To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type M… ▽ More

    Submitted 31 May, 2025; v1 submitted 23 December, 2024; originally announced December 2024.

    Comments: ACL 2025 main

  4. arXiv:2412.13110  [pdf, other

    cs.CL

    Improving Explainability of Sentence-level Metrics via Edit-level Attribution for Grammatical Error Correction

    Authors: Takumi Goto, Justin Vasselli, Taro Watanabe

    Abstract: Various evaluation metrics have been proposed for Grammatical Error Correction (GEC), but many, particularly reference-free metrics, lack explainability. This lack of explainability hinders researchers from analyzing the strengths and weaknesses of GEC models and limits the ability to provide detailed feedback for users. To address this issue, we propose attributing sentence-level scores to indivi… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

  5. arXiv:2410.03240  [pdf, other

    cs.CL

    Beyond Film Subtitles: Is YouTube the Best Approximation of Spoken Vocabulary?

    Authors: Adam Nohejl, Frederikus Hudi, Eunike Andriani Kardinata, Shintaro Ozaki, Maria Angelica Riera Machin, Hongyu Sun, Justin Vasselli, Taro Watanabe

    Abstract: Word frequency is a key variable in psycholinguistics, useful for modeling human familiarity with words even in the era of large language models (LLMs). Frequency in film subtitles has proved to be a particularly good approximation of everyday language exposure. For many languages, however, film subtitles are not easily available, or are overwhelmingly translated from English. We demonstrate that… ▽ More

    Submitted 11 January, 2025; v1 submitted 4 October, 2024; originally announced October 2024.

    Comments: Accepted to COLING 2025. 9 pages, 3 figures

  6. arXiv:2408.09639  [pdf, other

    cs.CL cs.AI

    How to Make the Most of LLMs' Grammatical Knowledge for Acceptability Judgments

    Authors: Yusuke Ide, Yuto Nishida, Justin Vasselli, Miyu Oba, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

    Abstract: The grammatical knowledge of language models (LMs) is often measured using a benchmark of linguistic minimal pairs, where the LMs are presented with a pair of acceptable and unacceptable sentences and required to judge which is more acceptable. Conventional approaches directly compare sentence probabilities assigned by LMs, but recent large language models (LLMs) are trained to perform tasks via p… ▽ More

    Submitted 7 February, 2025; v1 submitted 18 August, 2024; originally announced August 2024.

    Comments: NAACL 2025 main

  7. arXiv:2310.12352  [pdf, other

    cs.CL

    knn-seq: Efficient, Extensible kNN-MT Framework

    Authors: Hiroyuki Deguchi, Hayate Hirano, Tomoki Hoshino, Yuto Nishida, Justin Vasselli, Taro Watanabe

    Abstract: k-nearest-neighbor machine translation (kNN-MT) boosts the translation quality of a pre-trained neural machine translation (NMT) model by utilizing translation examples during decoding. Translation examples are stored in a vector database, called a datastore, which contains one entry for each target token from the parallel data it is made from. Due to its size, it is computationally expensive both… ▽ More

    Submitted 18 October, 2023; originally announced October 2023.