Skip to main content

Showing 1–30 of 30 results for author: Sugawara, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.03501  [pdf, ps, other

    cs.CL cs.AI

    Measuring Human Involvement in AI-Generated Text: A Case Study on Academic Writing

    Authors: Yuchen Guo, Zhicheng Dou, Huy H. Nguyen, Ching-Chun Chang, Saku Sugawara, Isao Echizen

    Abstract: Content creation has dramatically progressed with the rapid advancement of large language models like ChatGPT and Claude. While this progress has greatly enhanced various aspects of life and work, it has also negatively affected certain areas of society. A recent survey revealed that nearly 30% of college students use generative AI to help write academic papers and reports. Most countermeasures tr… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: IJCNN2025 accepted

  2. Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs

    Authors: Momoka Furuhashi, Hiroaki Funayama, Yuya Iwase, Yuichiroh Matsubayashi, Yoriko Isobe, Toru Nagahama, Saku Sugawara, Kentaro Inui

    Abstract: Short-reading comprehension questions help students understand text structure but lack effective feedback. Students struggle to identify and correct errors, while manual feedback creation is labor-intensive. This highlights the need for automated feedback linking responses to a scoring rubric for deeper comprehension. Despite advances in Natural Language Processing (NLP), research has focused on… ▽ More

    Submitted 26 January, 2025; originally announced January 2025.

    Comments: 16th International Conference on Education and New Learning Technologies

  3. arXiv:2410.06022  [pdf, other

    cs.CL

    Can Language Models Induce Grammatical Knowledge from Indirect Evidence?

    Authors: Miyu Oba, Yohei Oseki, Akiyo Fukatsu, Akari Haga, Hiroki Ouchi, Taro Watanabe, Saku Sugawara

    Abstract: What kinds of and how much data is necessary for language models to induce grammatical knowledge to judge sentence acceptability? Recent language models still have much room for improvement in their data efficiency compared to humans. This paper investigates whether language models efficiently use indirect data (indirect evidence), from which they infer sentence acceptability. In contrast, humans… ▽ More

    Submitted 23 October, 2024; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: This paper is accepted at EMNLP 2024 Main

  4. arXiv:2410.04838  [pdf, other

    cs.CL

    Rationale-Aware Answer Verification by Pairwise Self-Evaluation

    Authors: Akira Kawabata, Saku Sugawara

    Abstract: Answer verification identifies correct solutions among candidates generated by large language models (LLMs). Current approaches typically train verifier models by labeling solutions as correct or incorrect based solely on whether the final answer matches the gold answer. However, this approach neglects any flawed rationale in the solution yielding the correct answer, undermining the verifier's abi… ▽ More

    Submitted 25 October, 2024; v1 submitted 7 October, 2024; originally announced October 2024.

    Comments: EMNLP 2024

  5. arXiv:2407.03963  [pdf, other

    cs.CL cs.AI

    LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

    Authors: LLM-jp, :, Akiko Aizawa, Eiji Aramaki, Bowen Chen, Fei Cheng, Hiroyuki Deguchi, Rintaro Enomoto, Kazuki Fujii, Kensuke Fukumoto, Takuya Fukushima, Namgi Han, Yuto Harada, Chikara Hashimoto, Tatsuya Hiraoka, Shohei Hisada, Sosuke Hosokawa, Lu Jie, Keisuke Kamata, Teruhito Kanazawa, Hiroki Kanezashi, Hiroshi Kataoka, Satoru Katsumata, Daisuke Kawahara, Seiya Kawano , et al. (58 additional authors not shown)

    Abstract: This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its… ▽ More

    Submitted 30 December, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

  6. arXiv:2406.13397  [pdf, other

    cs.CL

    MoreHopQA: More Than Multi-hop Reasoning

    Authors: Julian Schnitzler, Xanh Ho, Jiahao Huang, Florian Boudin, Saku Sugawara, Akiko Aizawa

    Abstract: Most existing multi-hop datasets are extractive answer datasets, where the answers to the questions can be extracted directly from the provided context. This often leads models to use heuristics or shortcuts instead of performing true multi-hop reasoning. In this paper, we propose a new multi-hop dataset, MoreHopQA, which shifts from extractive to generative answers. Our dataset is created by util… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

    Comments: 8 pages, 5 figures. First three authors contributed equally

  7. arXiv:2406.03666  [pdf, other

    cs.CL

    What Makes Language Models Good-enough?

    Authors: Daiki Asami, Saku Sugawara

    Abstract: Psycholinguistic research suggests that humans may build a representation of linguistic input that is 'good-enough' for the task at hand. This study examines what architectural features make language models learn human-like good-enough language processing. We focus on the number of layers and self-attention heads in Transformers. We create a good-enough language processing (GELP) evaluation datase… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: To appear in Findings of ACL2024

  8. PUAD: Frustratingly Simple Method for Robust Anomaly Detection

    Authors: Shota Sugawara, Ryuji Imamura

    Abstract: Developing an accurate and fast anomaly detection model is an important task in real-time computer vision applications. There has been much research to develop a single model that detects either structural or logical anomalies, which are inherently distinct. The majority of the existing approaches implicitly assume that the anomaly can be represented by identifying the anomalous location. However,… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

    Comments: 8 pages, 4 figures

    Journal ref: IEEE International Conference on Image Processing (ICIP), pp. 842-848, 2024

  9. arXiv:2312.08755  [pdf, other

    cs.CL

    PROPRES: Investigating the Projectivity of Presupposition with Various Triggers and Environments

    Authors: Daiki Asami, Saku Sugawara

    Abstract: What makes a presupposition of an utterance -- information taken for granted by its speaker -- different from other pragmatic inferences such as an entailment is projectivity (e.g., the negative sentence the boy did not stop shedding tears presupposes the boy had shed tears before). The projectivity may vary depending on the combination of presupposition triggers and environments. However, prior n… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Accepted by the 27th Conference on Computational Natural Language Learning (CoNLL2023)

  10. arXiv:2311.18353  [pdf, other

    cs.CL

    Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension

    Authors: Akira Kawabata, Saku Sugawara

    Abstract: To precisely evaluate a language model's capability for logical reading comprehension, we present a dataset for testing the understanding of the rationale behind critical reasoning. For questions taken from an existing multiplechoice logical reading comprehension dataset, we crowdsource rationale texts that explain why we should select or eliminate answer options, resulting in 3,003 multiple-choic… ▽ More

    Submitted 30 November, 2023; originally announced November 2023.

    Comments: Accepted to EMNLP 2023

  11. arXiv:2306.02258  [pdf, other

    cs.CL

    Probing Physical Reasoning with Counter-Commonsense Context

    Authors: Kazushi Kondo, Saku Sugawara, Akiko Aizawa

    Abstract: In this study, we create a CConS (Counter-commonsense Contextual Size comparison) dataset to investigate how physical commonsense affects the contextualized size comparison task; the proposed dataset consists of both contexts that fit physical commonsense and those that do not. This dataset tests the ability of language models to predict the size relationship between objects under various contexts… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: Accepted to ACL 2023(Short Paper)

  12. arXiv:2305.15130  [pdf, other

    cs.CL cs.AI

    On Degrees of Freedom in Defining and Testing Natural Language Understanding

    Authors: Saku Sugawara, Shun Tsugita

    Abstract: Natural language understanding (NLU) studies often exaggerate or underestimate the capabilities of systems, thereby limiting the reproducibility of their findings. These erroneous evaluations can be attributed to the difficulty of defining and testing NLU adequately. In this position paper, we reconsider this challenge by identifying two types of researcher degrees of freedom. We revisit Turing's… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted to Findings of ACL 2023

  13. arXiv:2302.05963  [pdf, other

    cs.CL

    Analyzing the Effectiveness of the Underlying Reasoning Tasks in Multi-hop Question Answering

    Authors: Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, Akiko Aizawa

    Abstract: To explain the predicted answers and evaluate the reasoning abilities of models, several studies have utilized underlying reasoning (UR) tasks in multi-hop question answering (QA) datasets. However, it remains an open question as to how effective UR tasks are for the QA task when training models on both tasks in an end-to-end manner. In this study, we address this question by analyzing the effecti… ▽ More

    Submitted 12 February, 2023; originally announced February 2023.

    Comments: Accepted by EACL 2023 (Findings)

  14. arXiv:2212.07075  [pdf, other

    cs.CV cs.CL

    Cross-Modal Similarity-Based Curriculum Learning for Image Captioning

    Authors: Hongkuan Zhang, Saku Sugawara, Akiko Aizawa, Lei Zhou, Ryohei Sasano, Koichi Takeda

    Abstract: Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image-caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of d… ▽ More

    Submitted 14 December, 2022; originally announced December 2022.

    Comments: EMNLP 2022

  15. arXiv:2211.16220  [pdf, other

    cs.CL

    Which Shortcut Solution Do Question Answering Models Prefer to Learn?

    Authors: Kazutoshi Shinoda, Saku Sugawara, Akiko Aizawa

    Abstract: Question answering (QA) models for reading comprehension tend to learn shortcut solutions rather than the solutions intended by QA datasets. QA models that have learned shortcut solutions can achieve human-level performance in shortcut examples where shortcuts are valid, but these same behaviors degrade generalization potential on anti-shortcut examples where shortcuts are invalid. Various methods… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: Accepted to AAAI 2023

  16. arXiv:2211.16093  [pdf, ps, other

    cs.CL

    Penalizing Confident Predictions on Largely Perturbed Inputs Does Not Improve Out-of-Distribution Generalization in Question Answering

    Authors: Kazutoshi Shinoda, Saku Sugawara, Akiko Aizawa

    Abstract: Question answering (QA) models are shown to be insensitive to large perturbations to inputs; that is, they make correct and confident predictions even when given largely perturbed inputs from which humans can not correctly derive answers. In addition, QA models fail to generalize to other domains and adversarial test sets, while humans maintain high accuracy. Based on these observations, we assume… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: Accepted to the KnowledgeNLP workshop at AAAI 2023

  17. arXiv:2210.16079  [pdf, other

    cs.CL

    Debiasing Masks: A New Framework for Shortcut Mitigation in NLU

    Authors: Johannes Mario Meissner, Saku Sugawara, Akiko Aizawa

    Abstract: Debiasing language models from unwanted behaviors in Natural Language Understanding tasks is a topic with rapidly increasing interest in the NLP community. Spurious statistical correlations in the data allow models to perform shortcuts and avoid uncovering more advanced and desirable linguistic features. A multitude of effective debiasing approaches has been proposed, but flexibility remains a maj… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022

  18. arXiv:2210.14541  [pdf, other

    cs.CL

    Look to the Right: Mitigating Relative Position Bias in Extractive Question Answering

    Authors: Kazutoshi Shinoda, Saku Sugawara, Akiko Aizawa

    Abstract: Extractive question answering (QA) models tend to exploit spurious correlations to make predictions when a training set has unintended biases. This tendency results in models not being generalizable to examples where the correlations do not hold. Determining the spurious correlations QA models can exploit is crucial in building generalizable QA models in real-world applications; moreover, a method… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: Accepted to BlackboxNLP 2022

  19. arXiv:2210.05208  [pdf, other

    cs.CL

    How Well Do Multi-hop Reading Comprehension Models Understand Date Information?

    Authors: Xanh Ho, Saku Sugawara, Akiko Aizawa

    Abstract: Several multi-hop reading comprehension datasets have been proposed to resolve the issue of reasoning shortcuts by which questions can be answered without performing multi-hop reasoning. However, the ability of multi-hop models to perform step-by-step reasoning when finding an answer to a comparison question remains unclear. It is also unclear how questions about the internal reasoning process are… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: 10 pages, 2 figures, and 8 tables; Accepted to AACL-IJCNLP 2022

  20. arXiv:2209.07760  [pdf, other

    cs.CL cs.AI

    Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios

    Authors: Mana Ashida, Saku Sugawara

    Abstract: The possible consequences for the same context may vary depending on the situation we refer to. However, current studies in natural language processing do not focus on situated commonsense reasoning under multiple possible scenarios. This study frames this task by asking multiple questions with the same set of possible endings as candidate answers, given a short story text. Our resulting dataset,… ▽ More

    Submitted 16 September, 2022; originally announced September 2022.

    Comments: Accepted to COLING 2022

  21. arXiv:2209.01824  [pdf, other

    cs.CL

    A Survey on Measuring and Mitigating Reasoning Shortcuts in Machine Reading Comprehension

    Authors: Xanh Ho, Johannes Mario Meissner, Saku Sugawara, Akiko Aizawa

    Abstract: The issue of shortcut learning is widely known in NLP and has been an important research focus in recent years. Unintended correlations in the data enable models to easily solve tasks that were meant to exhibit advanced language understanding and reasoning capabilities. In this survey paper, we focus on the field of machine reading comprehension (MRC), an important task for showcasing high-level l… ▽ More

    Submitted 6 September, 2023; v1 submitted 5 September, 2022; originally announced September 2022.

    Comments: 18 pages, 2 figures, 4 tables

  22. arXiv:2203.06342  [pdf, other

    cs.CL cs.AI

    What Makes Reading Comprehension Questions Difficult?

    Authors: Saku Sugawara, Nikita Nangia, Alex Warstadt, Samuel R. Bowman

    Abstract: For a natural language understanding benchmark to be useful in research, it has to consist of examples that are diverse and difficult enough to discriminate among current and near-future state-of-the-art systems. However, we do not yet know how best to select text sources to collect a variety of challenging examples. In this study, we crowdsource multiple-choice reading comprehension questions for… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

    Comments: ACL 2022

  23. arXiv:2109.11256  [pdf, other

    cs.CL cs.AI

    Can Question Generation Debias Question Answering Models? A Case Study on Question-Context Lexical Overlap

    Authors: Kazutoshi Shinoda, Saku Sugawara, Akiko Aizawa

    Abstract: Question answering (QA) models for reading comprehension have been demonstrated to exploit unintended dataset biases such as question-context lexical overlap. This hinders QA models from generalizing to under-represented samples such as questions with low lexical overlap. Question generation (QG), a method for augmenting QA datasets, can be a solution for such performance degradation if QG can pro… ▽ More

    Submitted 23 September, 2021; originally announced September 2021.

    Comments: MRQA workshop at EMNLP 2021

  24. arXiv:2106.03020  [pdf, other

    cs.CL

    Embracing Ambiguity: Shifting the Training Target of NLI Models

    Authors: Johannes Mario Meissner, Napat Thumwanit, Saku Sugawara, Akiko Aizawa

    Abstract: Natural Language Inference (NLI) datasets contain examples with highly ambiguous labels. While many research works do not pay much attention to this fact, several recent efforts have been made to acknowledge and embrace the existence of ambiguity, such as UNLI and ChaosNLI. In this paper, we explore the option of training directly on the estimated label distribution of the annotators in the NLI ta… ▽ More

    Submitted 5 June, 2021; originally announced June 2021.

    Comments: Accepted to ACL 2021

  25. arXiv:2106.00794  [pdf, other

    cs.CL cs.AI cs.HC

    What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

    Authors: Nikita Nangia, Saku Sugawara, Harsh Trivedi, Alex Warstadt, Clara Vania, Samuel R. Bowman

    Abstract: Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving… ▽ More

    Submitted 1 June, 2021; originally announced June 2021.

    Comments: ACL 2021

  26. arXiv:2011.01060  [pdf, other

    cs.CL

    Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps

    Authors: Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, Akiko Aizawa

    Abstract: A multi-hop question answering (QA) dataset aims to test reasoning and inference skills by requiring a model to read multiple paragraphs to answer a given question. However, current datasets do not provide a complete explanation for the reasoning process from the question to the answer. Further, previous studies revealed that many examples in existing multi-hop datasets do not require multi-hop re… ▽ More

    Submitted 12 November, 2020; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: Accepted by COLING 2020

  27. arXiv:2004.03238  [pdf, other

    cs.CL cs.AI cs.LG

    Improving the Robustness of QA Models to Challenge Sets with Variational Question-Answer Pair Generation

    Authors: Kazutoshi Shinoda, Saku Sugawara, Akiko Aizawa

    Abstract: Question answering (QA) models for reading comprehension have achieved human-level accuracy on in-distribution test sets. However, they have been demonstrated to lack robustness to challenge sets, whose distribution is different from that of training sets. Existing data augmentation methods mitigate this problem by simply augmenting training sets with synthetic examples sampled from the same distr… ▽ More

    Submitted 3 June, 2021; v1 submitted 7 April, 2020; originally announced April 2020.

    Comments: ACL-IJCNLP 2021 SRW

  28. arXiv:2004.01912  [pdf, other

    cs.CL

    Benchmarking Machine Reading Comprehension: A Psychological Perspective

    Authors: Saku Sugawara, Pontus Stenetorp, Akiko Aizawa

    Abstract: Machine reading comprehension (MRC) has received considerable attention as a benchmark for natural language understanding. However, the conventional task design of MRC lacks explainability beyond the model interpretation, i.e., reading comprehension by a model cannot be explained in human terms. To this end, this position paper provides a theoretical basis for the design of MRC datasets based on p… ▽ More

    Submitted 26 January, 2021; v1 submitted 4 April, 2020; originally announced April 2020.

    Comments: 21 pages, EACL 2021

  29. arXiv:1911.09241  [pdf, ps, other

    cs.CL

    Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets

    Authors: Saku Sugawara, Pontus Stenetorp, Kentaro Inui, Akiko Aizawa

    Abstract: Existing analysis work in machine reading comprehension (MRC) is largely concerned with evaluating the capabilities of systems. However, the capabilities of datasets are not assessed for benchmarking language understanding precisely. We propose a semi-automated, ablation-based methodology for this challenge; By checking whether questions can be solved even after removing features associated with a… ▽ More

    Submitted 20 November, 2019; originally announced November 2019.

    Comments: 11 pages, AAAI2020, with extra examples, data: https://github.com/Alab-NII/mrc-ablation

  30. arXiv:1808.09384  [pdf, ps, other

    cs.CL cs.AI

    What Makes Reading Comprehension Questions Easier?

    Authors: Saku Sugawara, Kentaro Inui, Satoshi Sekine, Akiko Aizawa

    Abstract: A challenge in creating a dataset for machine reading comprehension (MRC) is to collect questions that require a sophisticated understanding of language to answer beyond using superficial cues. In this work, we investigate what makes questions easier across recent 12 MRC datasets with three question styles (answer extraction, description, and multiple choice). We propose to employ simple heuristic… ▽ More

    Submitted 28 August, 2018; originally announced August 2018.

    Comments: 12 pages, EMNLP2018