Search | arXiv e-print repository

Multi-Agent Reinforcement Learning-based Cooperative Autonomous Driving in Smart Intersections

Authors: Taoyuan Yu, Kui Wang, Zongdian Li, Tao Yu, Kei Sakaguchi

Abstract: Unsignalized intersections pose significant safety and efficiency challenges due to complex traffic flows. This paper proposes a novel roadside unit (RSU)-centric cooperative driving system leveraging global perception and vehicle-to-infrastructure (V2I) communication. The core of the system is an RSU-based decision-making module using a two-stage hybrid reinforcement learning (RL) framework. At f… ▽ More Unsignalized intersections pose significant safety and efficiency challenges due to complex traffic flows. This paper proposes a novel roadside unit (RSU)-centric cooperative driving system leveraging global perception and vehicle-to-infrastructure (V2I) communication. The core of the system is an RSU-based decision-making module using a two-stage hybrid reinforcement learning (RL) framework. At first, policies are pre-trained offline using conservative Q-learning (CQL) combined with behavior cloning (BC) on collected dataset. Subsequently, these policies are fine-tuned in the simulation using multi-agent proximal policy optimization (MAPPO), aligned with a self-attention mechanism to effectively solve inter-agent dependencies. RSUs perform real-time inference based on the trained models to realize vehicle control via V2I communications. Extensive experiments in CARLA environment demonstrate high effectiveness of the proposed system, by: \textit{(i)} achieving failure rates below 0.03\% in coordinating three connected and autonomous vehicles (CAVs) through complex intersection scenarios, significantly outperforming the traditional Autoware control method, and \textit{(ii)} exhibiting strong robustness across varying numbers of controlled agents and shows promising generalization capabilities on other maps. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Comments: 7 pages

arXiv:2504.20542 [pdf, other]

Digital Twin-Empowered Cooperative Autonomous Car-sharing Services: Proof-of-Concept

Authors: Kazuma Nonomura, Kui Wang, Zongdian Li, Tao Yu, Kei Sakaguchi

Abstract: This paper presents a digital twin-empowered real-time optimal delivery system specifically validated through a proof-of-concept (PoC) demonstration of a real-world autonomous car-sharing service. This study integrates real-time data from roadside units (RSUs) and connected and autonomous vehicles (CAVs) within a digital twin of a campus environment to address the dynamic challenges of urban traff… ▽ More This paper presents a digital twin-empowered real-time optimal delivery system specifically validated through a proof-of-concept (PoC) demonstration of a real-world autonomous car-sharing service. This study integrates real-time data from roadside units (RSUs) and connected and autonomous vehicles (CAVs) within a digital twin of a campus environment to address the dynamic challenges of urban traffic. The proposed system leverages the Age of Information (AoI) metric to optimize vehicle routing by maintaining data freshness and dynamically adapting to real-time traffic conditions. Experimental results from the PoC demonstrate a 22% improvement in delivery efficiency compared to conventional shortest-path methods that do not consider information freshness. Furthermore, digital twin-based simulation results demonstrate that this proposed system improves overall delivery efficiency by 12% and effectively reduces the peak average AoI by 23% compared to the conventional method, where each vehicle selects the shortest route without considering information freshness. This study confirms the practical feasibility of cooperative driving systems, highlighting their potential to enhance smart mobility solutions through scalable digital twin deployments in complex urban environments. △ Less

Submitted 29 April, 2025; originally announced April 2025.

Comments: The paper was accepted by the 36th IEEE Intelligent Vehicles Symposium (IEEE IV 2025)

arXiv:2503.23899 [pdf, other]

Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Authors: Diana Galvan-Sosa, Gabrielle Gaudeau, Pride Kavumba, Yunmeng Li, Hongyi gu, Zheng Yuan, Keisuke Sakaguchi, Paula Buttery

Abstract: The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, wr… ▽ More The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code will be made available upon acceptance. △ Less

Submitted 31 March, 2025; originally announced March 2025.

Comments: 9 main pages (21 appendix pages), 7 figures, submitted to ACL 2025

ACM Class: I.2.7

arXiv:2503.03590 [pdf, other]

Digital Twin-Enabled Blockage-Aware Dynamic mmWave Multi-Hop V2X Communication

Authors: Supat Roongpraiwan, Zongdian Li, Tao Yu, Kei Sakaguchi

Abstract: Millimeter wave (mmWave) technology in vehicle-to-everything (V2X) communication offers unprecedented data rates and low latency, but faces significant reliability challenges due to signal blockages and limited range. This paper introduces a novel system for managing dynamic multi-hop mmWave V2X communications in complex blocking environments. We present a system architecture that integrates a mob… ▽ More Millimeter wave (mmWave) technology in vehicle-to-everything (V2X) communication offers unprecedented data rates and low latency, but faces significant reliability challenges due to signal blockages and limited range. This paper introduces a novel system for managing dynamic multi-hop mmWave V2X communications in complex blocking environments. We present a system architecture that integrates a mobility digital twin (DT) with the multi-hop routing control plane, providing a comprehensive, real-time view of the network and its surrounding traffic environment. This integration enables the control plane to make informed routing decisions based on rich contextual data about vehicles, infrastructure, and potential signal blockages. Leveraging this DT-enhanced architecture, we propose an advanced routing algorithm that combines high-precision environmental data with trajectory prediction to achieve blockage-aware mmWave multi-hop V2X routing. Our algorithm anticipates network topology changes and adapts topology dynamically to maintain reliable connections. We evaluate our approach through proof-of-concept simulations using a mobility DT of the Nishishinjuku area. Results demonstrate that our DT-enabled routing strategy significantly outperforms conventional methods in maintaining reliable mmWave V2X connections across various traffic scenarios, including fully connected and mixed traffic environments. △ Less

Submitted 17 March, 2025; v1 submitted 5 March, 2025; originally announced March 2025.

arXiv:2502.00344 [pdf, other]

FinchGPT: a Transformer based language model for birdsong analysis

Authors: Kosei Kobayashi, Kosuke Matsuzaki, Masaya Taniguchi, Keisuke Sakaguchi, Kentaro Inui, Kentaro Abe

Abstract: The long-range dependencies among the tokens, which originate from hierarchical structures, are a defining hallmark of human language. However, whether similar dependencies exist within the sequential vocalization of non-human animals remains a topic of investigation. Transformer architectures, known for their ability to model long-range dependencies among tokens, provide a powerful tool for inves… ▽ More The long-range dependencies among the tokens, which originate from hierarchical structures, are a defining hallmark of human language. However, whether similar dependencies exist within the sequential vocalization of non-human animals remains a topic of investigation. Transformer architectures, known for their ability to model long-range dependencies among tokens, provide a powerful tool for investigating this phenomenon. In this study, we employed the Transformer architecture to analyze the songs of Bengalese finch (Lonchura striata domestica), which are characterized by their highly variable and complex syllable sequences. To this end, we developed FinchGPT, a Transformer-based model trained on a textualized corpus of birdsongs, which outperformed other architecture models in this domain. Attention weight analysis revealed that FinchGPT effectively captures long-range dependencies within syllables sequences. Furthermore, reverse engineering approaches demonstrated the impact of computational and biological manipulations on its performance: restricting FinchGPT's attention span and disrupting birdsong syntax through the ablation of specific brain nuclei markedly influenced the model's outputs. Our study highlights the transformative potential of large language models (LLMs) in deciphering the complexities of animal vocalizations, offering a novel framework for exploring the structural properties of non-human communication systems while shedding light on the computational distinctions between biological brains and artificial neural networks. △ Less

Submitted 1 February, 2025; originally announced February 2025.

Comments: 12 pages, 4 figures

arXiv:2501.15754 [pdf, other]

Weight-based Analysis of Detokenization in Language Models: Understanding the First Stage of Inference Without Inference

Authors: Go Kamoda, Benjamin Heinzerling, Tatsuro Inaba, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

Abstract: According to the stages-of-inference hypothesis, early layers of language models map their subword-tokenized input, which does not necessarily correspond to a linguistically meaningful segmentation, to more meaningful representations that form the model's "inner vocabulary". Prior analysis of this detokenization stage has predominantly relied on probing and interventions such as path patching, whi… ▽ More According to the stages-of-inference hypothesis, early layers of language models map their subword-tokenized input, which does not necessarily correspond to a linguistically meaningful segmentation, to more meaningful representations that form the model's "inner vocabulary". Prior analysis of this detokenization stage has predominantly relied on probing and interventions such as path patching, which involve selecting particular inputs, choosing a subset of components that will be patched, and then observing changes in model behavior. Here, we show that several important aspects of the detokenization stage can be understood purely by analyzing model weights, without performing any model inference steps. Specifically, we introduce an analytical decomposition of first-layer attention in GPT-2. Our decomposition yields interpretable terms that quantify the relative contributions of position-related, token-related, and mixed effects. By focusing on terms in this decomposition, we discover weight-based explanations of attention bias toward close tokens and attention for detokenization. △ Less

Submitted 10 February, 2025; v1 submitted 26 January, 2025; originally announced January 2025.

Comments: 22 pages, 14 figures, to appear in NAACL Findings 2025

arXiv:2412.01113 [pdf, other]

Think-to-Talk or Talk-to-Think? When LLMs Come Up with an Answer in Multi-Step Arithmetic Reasoning

Authors: Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Ana Brassard, Keisuke Sakaguchi, Kentaro Inui

Abstract: This study investigates the internal reasoning process of language models during arithmetic multi-step reasoning, motivated by the question of when they internally form their answers during reasoning. Particularly, we inspect whether the answer is determined before or after chain-of-thought (CoT) begins to determine whether models follow a post-hoc Think-to-Talk mode or a step-by-step Talk-to-Thin… ▽ More This study investigates the internal reasoning process of language models during arithmetic multi-step reasoning, motivated by the question of when they internally form their answers during reasoning. Particularly, we inspect whether the answer is determined before or after chain-of-thought (CoT) begins to determine whether models follow a post-hoc Think-to-Talk mode or a step-by-step Talk-to-Think mode of explanation. Through causal probing experiments in controlled arithmetic reasoning tasks, we found systematic internal reasoning patterns across models in our case study; for example, single-step subproblems are solved before CoT begins, and more complicated multi-step calculations are performed during CoT. △ Less

Submitted 17 April, 2025; v1 submitted 1 December, 2024; originally announced December 2024.

arXiv:2411.06387 [pdf, other]

Self-Training Meets Consistency: Improving LLMs' Reasoning with Consistency-Driven Rationale Evaluation

Authors: Jaehyeok Lee, Keisuke Sakaguchi, JinYeong Bak

Abstract: Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this… ▽ More Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches. △ Less

Submitted 6 February, 2025; v1 submitted 10 November, 2024; originally announced November 2024.

Comments: Accepted to NAACL 2025

arXiv:2410.12163 [pdf, other]

Augmented Intelligence in Smart Intersections: Local Digital Twins-Assisted Hybrid Autonomous Driving

Authors: Kui Wang, Kazuma Nonomura, Zongdian Li, Tao Yu, Kei Sakaguchi, Omar Hashash, Walid Saad, Changyang She, Yonghui Li

Abstract: Vehicle-road collaboration is a promising approach for enhancing the safety and efficiency of autonomous driving by extending the intelligence of onboard systems to smart roadside infrastructures. The introduction of digital twins (DTs), particularly local DTs (LDTs) at the edge, in smart mobility presents a new embodiment of augmented intelligence, which could enhance information exchange and ext… ▽ More Vehicle-road collaboration is a promising approach for enhancing the safety and efficiency of autonomous driving by extending the intelligence of onboard systems to smart roadside infrastructures. The introduction of digital twins (DTs), particularly local DTs (LDTs) at the edge, in smart mobility presents a new embodiment of augmented intelligence, which could enhance information exchange and extract human driving expertise to improve onboard intelligence. This paper presents a novel LDT-assisted hybrid autonomous driving system for improving safety and efficiency in traffic intersections. By leveraging roadside units (RSUs) equipped with sensory and computing capabilities, the proposed system continuously monitors traffic, extracts human driving knowledge, and generates intersection-specific local driving agents through an offline reinforcement learning (RL) framework. When connected and automated vehicles (CAVs) pass through RSU-equipped intersections, RSUs can provide local agents to support safe and efficient driving in local areas. Meanwhile, they provide real-time cooperative perception (CP) to broaden onboard sensory horizons. The proposed LDT-assisted hybrid system is implemented with state-of-the-art products, e.g., CAVs and RSUs, and technologies, e.g., millimeter-wave (mmWave) communications. Hardware-in-the-loop (HiL) simulations and proof-of-concept (PoC) tests validate system performance from two standpoints: (i) The peak latency for CP and local agent downloading are 8.51 ms and 146 ms, respectively, aligning with 3GPP requirements for vehicle-to-everything (V2X) and model transfer use cases. Moreover, (ii) local driving agents can improve safety measures by 10% and reduce travel time by 15% compared with conventional onboard systems. The implemented prototype also demonstrates reliable real-time performance, fulfilling the targets of the proposed system design. △ Less

Submitted 18 October, 2024; v1 submitted 15 October, 2024; originally announced October 2024.

Comments: 14 pages, 9 figures

arXiv:2409.07232 [pdf, other]

Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee

Authors: Chayanon, Wichitrnithed, Woo-Sun-Yang, Yun, He, Brad Richardson, Koichi Sakaguchi, Manuel Arenaz, William I. Gustafson Jr., Jacob Shpund, Ulises Costi Blanco, Alvaro Goldar Dieste

Abstract: Currently, the Weather Research and Forecasting model (WRF) utilizes shared memory (OpenMP) and distributed memory (MPI) parallelisms. To take advantage of GPU resources on the Perlmutter supercomputer at NERSC, we port parts of the computationally expensive routines of the Fast Spectral Bin Microphysics (FSBM) microphysical scheme to NVIDIA GPUs using OpenMP device offloading directives. To facil… ▽ More Currently, the Weather Research and Forecasting model (WRF) utilizes shared memory (OpenMP) and distributed memory (MPI) parallelisms. To take advantage of GPU resources on the Perlmutter supercomputer at NERSC, we port parts of the computationally expensive routines of the Fast Spectral Bin Microphysics (FSBM) microphysical scheme to NVIDIA GPUs using OpenMP device offloading directives. To facilitate this process, we explore a workflow for optimization which uses both runtime profilers and a static code inspection tool Codee to refactor the subroutine. We observe a 2.08x overall speedup for the CONUS-12km thunderstorm test case. △ Less

Submitted 11 September, 2024; originally announced September 2024.

arXiv:2409.00040 [pdf, other]

Digital Twin-Empowered Routing Management for Reliable Multi-Hop Millimeter Wave V2X

Authors: Supat Roongpraiwan, Zongdian Li, Tao Yu, Kei Sakaguchi

Abstract: Digital twin (DT) technology can replicate physical entities in cyberspace. A mobility DT digitalizes connected and autonomous vehicles (CAVs) and their surrounding traffic environment, allowing to monitor the maneuvering and distribution of CAVs in real-time, which is crucial for managing vehicle-to-everything (V2X) connectivity, especially when millimeter wave (mmWave) is adopted. MmWave V2X rel… ▽ More Digital twin (DT) technology can replicate physical entities in cyberspace. A mobility DT digitalizes connected and autonomous vehicles (CAVs) and their surrounding traffic environment, allowing to monitor the maneuvering and distribution of CAVs in real-time, which is crucial for managing vehicle-to-everything (V2X) connectivity, especially when millimeter wave (mmWave) is adopted. MmWave V2X relies on dynamic multi-hop communications to ensure high reliability. Therefore, in this paper, the challenges of mmWave V2X are presented to motivate the utilization of DT, and then we introduce the system model for DT-based multi-hop routing management, incorporating two different routing algorithms: with and without future trajectory prediction. For proof of concept, we implement the proposed DT system using Unity-based AWSIM and evaluate the proposed algorithms via simulations. The results show that, compared to the conventional routing algorithm in vehicular ad hoc networks (VANETs), the DT-based algorithms significantly improve the reliability of mmWave V2X, and such improvements can be seen in both fully connected and mixed traffic scenarios. △ Less

Submitted 18 August, 2024; originally announced September 2024.

arXiv:2408.03554 [pdf, other]

Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection

Authors: Subaru Kimura, Ryota Tanaka, Shumpei Miyawaki, Jun Suzuki, Keisuke Sakaguchi

Abstract: We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, "goal hijacking via visual prompt injection" (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates… ▽ More We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, "goal hijacking via visual prompt injection" (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs. △ Less

Submitted 7 August, 2024; originally announced August 2024.

Comments: 8 pages, 6 figures, Accepted to NAACL 2024 SRW

arXiv:2407.03963 [pdf, other]

LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

Authors: LLM-jp, :, Akiko Aizawa, Eiji Aramaki, Bowen Chen, Fei Cheng, Hiroyuki Deguchi, Rintaro Enomoto, Kazuki Fujii, Kensuke Fukumoto, Takuya Fukushima, Namgi Han, Yuto Harada, Chikara Hashimoto, Tatsuya Hiraoka, Shohei Hisada, Sosuke Hosokawa, Lu Jie, Keisuke Kamata, Teruhito Kanazawa, Hiroki Kanezashi, Hiroshi Kataoka, Satoru Katsumata, Daisuke Kawahara, Seiya Kawano , et al. (58 additional authors not shown)

Abstract: This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its… ▽ More This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/. △ Less

Submitted 30 December, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

arXiv:2406.16078 [pdf, other]

First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning

Authors: Yoichi Aoki, Keito Kudo, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Keisuke Sakaguchi, Kentaro Inui

Abstract: Multi-step reasoning instruction, such as chain-of-thought prompting, is widely adopted to explore better language models (LMs) performance. We report on the systematic strategy that LMs employ in such a multi-step reasoning process. Our controlled experiments reveal that LMs rely more heavily on heuristics, such as lexical overlap, in the earlier stages of reasoning, where more reasoning steps re… ▽ More Multi-step reasoning instruction, such as chain-of-thought prompting, is widely adopted to explore better language models (LMs) performance. We report on the systematic strategy that LMs employ in such a multi-step reasoning process. Our controlled experiments reveal that LMs rely more heavily on heuristics, such as lexical overlap, in the earlier stages of reasoning, where more reasoning steps remain to reach a goal. Conversely, their reliance on heuristics decreases as LMs progress closer to the final answer through multiple reasoning steps. This suggests that LMs can backtrack only a limited number of future steps and dynamically combine heuristic strategies with rationale ones in tasks involving multi-step reasoning. △ Less

Submitted 7 October, 2024; v1 submitted 23 June, 2024; originally announced June 2024.

Comments: This paper is accepted at EMNLP 2024

arXiv:2406.06032 [pdf, other]

The Curse of Popularity: Popular Entities have Catastrophic Side Effects when Deleting Knowledge from Language Models

Authors: Ryosuke Takahashi, Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, Kentaro Inui

Abstract: Language models (LMs) encode world knowledge in their internal parameters through training. However, LMs may learn personal and confidential information from the training data, leading to privacy concerns such as data leakage. Therefore, research on knowledge deletion from LMs is essential. This study focuses on the knowledge stored in LMs and analyzes the relationship between the side effects of… ▽ More Language models (LMs) encode world knowledge in their internal parameters through training. However, LMs may learn personal and confidential information from the training data, leading to privacy concerns such as data leakage. Therefore, research on knowledge deletion from LMs is essential. This study focuses on the knowledge stored in LMs and analyzes the relationship between the side effects of knowledge deletion and the entities related to the knowledge. Our findings reveal that deleting knowledge related to popular entities can have catastrophic side effects. Furthermore, this research is the first to analyze knowledge deletion in models trained on synthetic knowledge graphs, indicating a new direction for controlled experiments. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2405.04818 [pdf, other]

ACORN: Aspect-wise Commonsense Reasoning Explanation Evaluation

Authors: Ana Brassard, Benjamin Heinzerling, Keito Kudo, Keisuke Sakaguchi, Kentaro Inui

Abstract: Evaluating the quality of free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to evaluate how LLMs rate explanatio… ▽ More Evaluating the quality of free-text explanations is a multifaceted, subjective, and labor-intensive task. Large language models (LLMs) present an appealing alternative due to their potential for consistency, scalability, and cost-efficiency. In this work, we present ACORN, a new dataset of 3,500 free-text explanations and aspect-wise quality ratings, and use it to evaluate how LLMs rate explanations. We observed that larger models outputted labels that maintained or increased the inter-annotator agreement, suggesting that they are within the expected variance between human raters. However, their correlation with majority-voted human ratings varied across different quality aspects, indicating that they are not a complete replacement. In turn, using LLMs as a supplement to a smaller group of human raters in some cases improved the correlation with the original majority labels. However, the effect was limited to cases where human raters were scarce, and an additional human rater had a more pronounced effect in all cases. Overall, we recommend against using LLMs as a complete replacement for human raters but encourage using them in configurations that end with targeted human involvement. Data available here: https://github.com/a-brassard/ACORN △ Less

Submitted 1 September, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

Comments: 18 pages, 7 figures, accepted to COLM 2024. Data available here: https://github.com/a-brassard/ACORN

arXiv:2405.03935 [pdf, other]

Roadside Units Assisted Localized Automated Vehicle Maneuvering: An Offline Reinforcement Learning Approach

Authors: Kui Wang, Changyang She, Zongdian Li, Tao Yu, Yonghui Li, Kei Sakaguchi

Abstract: Traffic intersections present significant challenges for the safe and efficient maneuvering of connected and automated vehicles (CAVs). This research proposes an innovative roadside unit (RSU)-assisted cooperative maneuvering system aimed at enhancing road safety and traveling efficiency at intersections for CAVs. We utilize RSUs for real-time traffic data acquisition and train an offline reinforc… ▽ More Traffic intersections present significant challenges for the safe and efficient maneuvering of connected and automated vehicles (CAVs). This research proposes an innovative roadside unit (RSU)-assisted cooperative maneuvering system aimed at enhancing road safety and traveling efficiency at intersections for CAVs. We utilize RSUs for real-time traffic data acquisition and train an offline reinforcement learning (RL) algorithm based on human driving data. Evaluation results obtained from hardware-in-loop autonomous driving simulations show that our approach employing the twin delayed deep deterministic policy gradient and behavior cloning (TD3+BC), achieves performance comparable to state-of-the-art autonomous driving systems in terms of safety measures while significantly enhancing travel efficiency by up to 17.38% in intersection areas. This paper makes a pivotal contribution to the field of intelligent transportation systems, presenting a breakthrough solution for improving urban traffic flow and safety at intersections. △ Less

Submitted 17 September, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

Comments: 6 pages, 6 figures

arXiv:2403.08173 [pdf, other]

A bargain for mergesorts (functional pearl) -- How to prove your mergesort correct and stable, almost for free

Authors: Cyril Cohen, Kazuhiko Sakaguchi

Abstract: We present a novel characterization of stable mergesort functions using relational parametricity, and show that it implies the correctness of mergesort. As a result, one can prove the correctness of several variations of mergesort (e.g., top-down, bottom-up, tail-recursive, non-tail-recursive, smooth, and non-smooth mergesorts) by proving the characterization property for each variation. To furthe… ▽ More We present a novel characterization of stable mergesort functions using relational parametricity, and show that it implies the correctness of mergesort. As a result, one can prove the correctness of several variations of mergesort (e.g., top-down, bottom-up, tail-recursive, non-tail-recursive, smooth, and non-smooth mergesorts) by proving the characterization property for each variation. To further motivate this work, we show a performance trade-off between tail-recursive and non-tail-recursive mergesorts that (1) the former in call-by-value evaluation avoids using up stack space and is efficient and (2) the latter in call-by-need evaluation is an optimal incremental sort, meaning that it performs only $\mathcal{O}(n + k \log k)$ comparisons to compute the least (or greatest) $k$ items of a list of length $n$. Thanks to our characterization and the parametricity translation, we deduced the correctness results, including stability, of various implementations of mergesort for lists, including highly optimized ones, in the Coq proof assistant. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: The supplementary material is available at https://github.com/pi8027/stablesort

arXiv:2402.14411 [pdf, other]

J-UniMorph: Japanese Morphological Annotation through the Universal Feature Schema

Authors: Kosuke Matsuzaki, Masaya Taniguchi, Kentaro Inui, Keisuke Sakaguchi

Abstract: We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language's agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 infl… ▽ More We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language's agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 inflected forms for each word and is primarily dominated by denominal verbs (i.e., [noun] +suru (do-PRS)). Morphologically, this form is equivalent to the verb suru (do). In contrast, J-UniMorph explores a much broader and more frequently used range of verb forms, offering 118 inflected forms for each word on average. It includes honorifics, a range of politeness levels, and other linguistic nuances, emphasizing the distinctive characteristics of the Japanese language. This paper presents detailed statistics and characteristics of J-UniMorph, comparing it with the Wiktionary Edition. We release J-UniMorph and its interactive visualizer publicly available, aiming to support cross-linguistic research and various applications. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: 14 pages, 4 figures

arXiv:2402.12682 [pdf, other]

doi 10.1109/TIV.2024.3368109

Smart Mobility Digital Twin Based Automated Vehicle Navigation System: A Proof of Concept

Authors: Kui Wang, Zongdian Li, Kazuma Nonomura, Tao Yu, Kei Sakaguchi, Omar Hashash, Walid Saad

Abstract: Digital twins (DTs) have driven major advancements across various industrial domains over the past two decades. With the rapid advancements in autonomous driving and vehicle-to-everything (V2X) technologies, integrating DTs into vehicular platforms is anticipated to further revolutionize smart mobility systems. In this paper, a new smart mobility DT (SMDT) platform is proposed for the control of c… ▽ More Digital twins (DTs) have driven major advancements across various industrial domains over the past two decades. With the rapid advancements in autonomous driving and vehicle-to-everything (V2X) technologies, integrating DTs into vehicular platforms is anticipated to further revolutionize smart mobility systems. In this paper, a new smart mobility DT (SMDT) platform is proposed for the control of connected and automated vehicles (CAVs) over next-generation wireless networks. In particular, the proposed platform enables cloud services to leverage the abilities of DTs to promote the autonomous driving experience. To enhance traffic efficiency and road safety measures, a novel navigation system that exploits available DT information is designed. The SMDT platform and navigation system are implemented with state-of-the-art products, e.g., CAVs and roadside units (RSUs), and emerging technologies, e.g., cloud and cellular V2X (C-V2X). In addition, proof-of-concept (PoC) experiments are conducted to validate system performance. The performance of SMDT is evaluated from two standpoints: (i) the rewards of the proposed navigation system on traffic efficiency and safety and, (ii) the latency and reliability of the SMDT platform. Our experimental results using SUMO-based large-scale traffic simulations show that the proposed SMDT can reduce the average travel time and the blocking probability due to unexpected traffic incidents. Furthermore, the results record a peak overall latency for DT modeling and route planning services to be 155.15 ms and 810.59 ms, respectively, which validates that our proposed design aligns with the 3GPP requirements for emerging V2X use cases and fulfills the targets of the proposed design. Our demonstration video can be found at https://youtu.be/3waQwlaHQkk. △ Less

Submitted 19 February, 2024; originally announced February 2024.

Comments: 15 pages, 10 figures

arXiv:2401.08654 [pdf, other]

Smart Mobility Digital Twin for Automated Driving: Design and Proof-of-Concept

Authors: Kui Wang, Zongdian Li, Tao Yu, Kei Sakaguchi

Abstract: During the past decade, smart mobility and intelligent vehicles have attracted increasing attention, because they promise to create a highly efficient and safe transportation system in the future. Meanwhile, digital twin, as an emerging technology, will play an important role in automated driving and intelligent transportation systems. This technology is applied in this paper to design a platform… ▽ More During the past decade, smart mobility and intelligent vehicles have attracted increasing attention, because they promise to create a highly efficient and safe transportation system in the future. Meanwhile, digital twin, as an emerging technology, will play an important role in automated driving and intelligent transportation systems. This technology is applied in this paper to design a platform for smart mobility, providing large-scale route planning services. Utilizing sensing technologies and cloud/edge computing, we build a digital twin system model that reflects the static and dynamic objects from the real world in real time. With the smart mobility platform, we realize traffic monitoring and route planning through cooperative environment perception to help automated vehicles circumvent jams. A proof-of-concept test with a real vehicle in real traffic is conducted to validate the functions and the delay performance of the proposed platform. △ Less

Submitted 24 December, 2023; originally announced January 2024.

arXiv:2401.08653 [pdf, other]

Digital Twins for Autonomous Driving: A Comprehensive Implementation and Demonstration

Authors: Kui Wang, Tao Yu, Zongdian Li, Kei Sakaguchi, Omar Hashash, Walid Saad

Abstract: The concept of a digital twin (DT) plays a pivotal role in the ongoing digital transformation and has achieved significant strides for various wireless applications in recent years. In particular, the field of autonomous vehicles is a domain that is ripe for exploiting the concept of DT. Nevertheless, there are many challenges that include holistic consideration and integration of hardware, softwa… ▽ More The concept of a digital twin (DT) plays a pivotal role in the ongoing digital transformation and has achieved significant strides for various wireless applications in recent years. In particular, the field of autonomous vehicles is a domain that is ripe for exploiting the concept of DT. Nevertheless, there are many challenges that include holistic consideration and integration of hardware, software, communication methods, and collaboration of edge/cloud computing. In this paper, an end-to-end (E2E) real-world smart mobility DT is designed and implemented for the purpose of autonomous driving. The proposed system utilizes roadside units (RSUs) and edge computing to capture real-world traffic information, which is then processed in the cloud to create a DT model. This DT model is then exploited to enable route planning services for the autonomous vehicle to avoid heavy traffic. Real-world experimental results show that the system reliability can reach 99.53% while achieving a latency that is 3.36% below the 3GPP recommended value of 100 ms for autonomous driving. These results clearly validate the effectiveness of the system according to practical 3GPP standards for sensor and state map sharing (SSMS) and information sharing. △ Less

Submitted 24 December, 2023; originally announced January 2024.

Comments: 7 pages, 8 figures

arXiv:2312.06432 [pdf, other]

doi 10.1109/IOTM.001.2300279

Internet of Federated Digital Twins (IoFDT): Connecting Twins Beyond Borders for Society 5.0

Authors: Tao Yu, Zongdian Li, Kei Sakaguchi, Omar Hashash, Walid Saad, Merouane Debbah

Abstract: The concept of digital twin (DT), which enables the creation of a programmable, digital representation of physical systems, is expected to revolutionize future industries and will lie at the heart of the vision of a future smart society, namely, Society 5.0, in which high integration between cyber (digital) and physical spaces is exploited to bring economic and societal advancements. However, the… ▽ More The concept of digital twin (DT), which enables the creation of a programmable, digital representation of physical systems, is expected to revolutionize future industries and will lie at the heart of the vision of a future smart society, namely, Society 5.0, in which high integration between cyber (digital) and physical spaces is exploited to bring economic and societal advancements. However, the success of such a DT-driven Society 5.0 requires a synergistic convergence of artificial intelligence and networking technologies into an integrated, programmable system that can coordinate DT networks to effectively deliver diverse Society 5.0 services. Prior works remain restricted to either qualitative study, simple analysis or software implementations of a single DT, and thus, they cannot provide the highly synergistic integration of digital and physical spaces as required by Society 5.0. In contrast, this paper envisions a novel concept of an Internet of Federated Digital Twins (IoFDT) that holistically integrates heterogeneous and physically separated DTs representing different Society 5.0 services within a single framework and system. For this concept of IoFDT, we first introduce a hierarchical architecture that integrates federated DTs through horizontal and vertical interactions, bridging cyber and physical spaces to unlock new possibilities. Then, we discuss challenges of realizing IoFDT, highlighting the intricacies across communication, computing, and AI-native networks while also underscoring potential innovative solutions. Subsequently, we elaborate on the importance of the implementation of a unified IoFDT platform that integrates all technical components and orchestrates their interactions, emphasizing the necessity of practical experimental platforms with a focus on real-world applications in areas like smart mobility. △ Less

Submitted 27 October, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

Journal ref: IEEE Internet of Things Magazine, vol.7, no.5, pp.64-71, Sept. 2024

arXiv:2312.00334 [pdf, other]

UAV-Aided Lifelong Learning for AoI and Energy Optimization in Non-Stationary IoT Networks

Authors: Zhenzhen Gong, Omar Hashash, Yingze Wang, Qimei Cui, Wei Ni, Walid Saad, Kei Sakaguchi

Abstract: In this paper, a novel joint energy and age of information (AoI) optimization framework for IoT devices in a non-stationary environment is presented. In particular, IoT devices that are distributed in the real-world are required to efficiently utilize their computing resources so as to balance the freshness of their data and their energy consumption. To optimize the performance of IoT devices in s… ▽ More In this paper, a novel joint energy and age of information (AoI) optimization framework for IoT devices in a non-stationary environment is presented. In particular, IoT devices that are distributed in the real-world are required to efficiently utilize their computing resources so as to balance the freshness of their data and their energy consumption. To optimize the performance of IoT devices in such a dynamic setting, a novel lifelong reinforcement learning (RL) solution that enables IoT devices to continuously adapt their policies to each newly encountered environment is proposed. Given that IoT devices have limited energy and computing resources, an unmanned aerial vehicle (UAV) is leveraged to visit the IoT devices and update the policy of each device sequentially. As such, the UAV is exploited as a mobile learning agent that can learn a shared knowledge base with a feature base in its training phase, and feature sets of a zero-shot learning method in its testing phase, to generalize between the environments. To optimize the trajectory and flying velocity of the UAV, an actor-critic network is leveraged so as to minimize the UAV energy consumption. Simulation results show that the proposed lifelong RL solution can outperform the state-of-art benchmarks by enhancing the balanced cost of IoT devices by $8.3\%$ when incorporating warm-start policies for unseen environments. In addition, our solution achieves up to $49.38\%$ reduction in terms of energy consumption by the UAV in comparison to the random flying strategy. △ Less

Submitted 30 November, 2023; originally announced December 2023.

Comments: 15 pages, 14 figures

arXiv:2310.17121 [pdf, other]

Test-time Augmentation for Factual Probing

Authors: Go Kamoda, Benjamin Heinzerling, Keisuke Sakaguchi, Kentaro Inui

Abstract: Factual probing is a method that uses prompts to test if a language model "knows" certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen… ▽ More Factual probing is a method that uses prompts to test if a language model "knows" certain world knowledge facts. A problem in factual probing is that small changes to the prompt can lead to large changes in model output. Previous work aimed to alleviate this problem by optimizing prompts via text mining or fine-tuning. However, such approaches are relation-specific and do not generalize to unseen relation types. Here, we propose to use test-time augmentation (TTA) as a relation-agnostic method for reducing sensitivity to prompt variations by automatically augmenting and ensembling prompts at test time. Experiments show improved model calibration, i.e., with TTA, model confidence better reflects prediction accuracy. Improvements in prediction accuracy are observed for some models, but for other models, TTA leads to degradation. Error analysis identifies the difficulty of producing high-quality prompt variations as the main challenge for TTA. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: 12 pages, 4 figures, accepted to EMNLP 2023 Findings (short paper)

arXiv:2305.19472 [pdf, other]

PlaSma: Making Small Language Models Better Procedural Knowledge Models for (Counterfactual) Planning

Authors: Faeze Brahman, Chandra Bhagavatula, Valentina Pyatkin, Jena D. Hwang, Xiang Lorraine Li, Hirona J. Arai, Soumya Sanyal, Keisuke Sakaguchi, Xiang Ren, Yejin Choi

Abstract: Procedural planning, which entails decomposing a high-level goal into a sequence of temporally ordered steps, is an important yet intricate task for machines. It involves integrating common-sense knowledge to reason about complex and often contextualized situations, e.g. ``scheduling a doctor's appointment without a phone''. While current approaches show encouraging results using large language mo… ▽ More Procedural planning, which entails decomposing a high-level goal into a sequence of temporally ordered steps, is an important yet intricate task for machines. It involves integrating common-sense knowledge to reason about complex and often contextualized situations, e.g. ``scheduling a doctor's appointment without a phone''. While current approaches show encouraging results using large language models (LLMs), they are hindered by drawbacks such as costly API calls and reproducibility issues. In this paper, we advocate planning using smaller language models. We present PlaSma, a novel two-pronged approach to endow small language models with procedural knowledge and (constrained) language planning capabilities. More concretely, we develop symbolic procedural knowledge distillation to enhance the commonsense knowledge in small language models and an inference-time algorithm to facilitate more structured and accurate reasoning. In addition, we introduce a new related task, Replanning, that requires a revision of a plan to cope with a constrained situation. In both the planning and replanning settings, we show that orders-of-magnitude smaller models (770M-11B parameters) can compete and often surpass their larger teacher models' capabilities. Finally, we showcase successful application of PlaSma in an embodied environment, VirtualHome. △ Less

Submitted 18 September, 2024; v1 submitted 30 May, 2023; originally announced May 2023.

Comments: ICLR 2024 version , 31 pages

arXiv:2304.10282 [pdf, other]

The Seven Worlds and Experiences of the Wireless Metaverse: Challenges and Opportunities

Authors: Omar Hashash, Christina Chaccour, Walid Saad, Tao Yu, Kei Sakaguchi, Merouane Debbah

Abstract: The wireless metaverse will create diverse user experiences at the intersection of the physical, digital, and virtual worlds. These experiences will enable novel interactions between the constituents (e.g., extended reality (XR) users and avatars) of the three worlds. However, remarkably, to date, there is no holistic vision that identifies the full set of metaverse worlds, constituents, and exper… ▽ More The wireless metaverse will create diverse user experiences at the intersection of the physical, digital, and virtual worlds. These experiences will enable novel interactions between the constituents (e.g., extended reality (XR) users and avatars) of the three worlds. However, remarkably, to date, there is no holistic vision that identifies the full set of metaverse worlds, constituents, and experiences, and the implications of their associated interactions on next-generation communication and computing systems. In this paper, we present a holistic vision of a limitless, wireless metaverse that distills the metaverse into an intersection of seven worlds and experiences that include the: i) physical, digital, and virtual worlds, along with the ii) cyber, extended, live, and parallel experiences. We then articulate how these experiences bring forth interactions between diverse metaverse constituents, namely, a) humans and avatars and b) connected intelligence systems and their digital twins (DTs). Then, we explore the wireless, computing, and artificial intelligence (AI) challenges that must be addressed to establish metaverse-ready networks that support these experiences and interactions. We particularly highlight the need for end-to-end synchronization of DTs, and the role of human-level AI and reasoning abilities for cognitive avatars. Moreover, we articulate a sequel of open questions that should ignite the quest for the future metaverse. We conclude with a set of recommendations to deploy the limitless metaverse over future wireless systems. △ Less

Submitted 20 April, 2023; originally announced April 2023.

arXiv:2303.18027 [pdf, other]

Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

Authors: Jungo Kasai, Yuhei Kasai, Keisuke Sakaguchi, Yutaro Yamada, Dragomir Radev

Abstract: As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years, including the current year. O… ▽ More As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years, including the current year. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all six years of the exams, highlighting LLMs' potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as suggesting euthanasia. Further, our analysis shows that the API costs are generally higher and the maximum context size is smaller for Japanese because of the way non-Latin scripts are currently tokenized in the pipeline. We release our benchmark as Igaku QA as well as all model outputs and exam metadata. We hope that our results and benchmark will spur progress on more diverse applications of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA. △ Less

Submitted 5 April, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

Comments: Added results from the March 2023 exam

arXiv:2303.15381 [pdf, other]

Causal schema induction for knowledge discovery

Authors: Michael Regan, Jena D. Hwang, Keisuke Sakaguchi, James Pustejovsky

Abstract: Making sense of familiar yet new situations typically involves making generalizations about causal schemas, stories that help humans reason about event sequences. Reasoning about events includes identifying cause and effect relations shared across event instances, a process we refer to as causal schema induction. Statistical schema induction systems may leverage structural knowledge encoded in dis… ▽ More Making sense of familiar yet new situations typically involves making generalizations about causal schemas, stories that help humans reason about event sequences. Reasoning about events includes identifying cause and effect relations shared across event instances, a process we refer to as causal schema induction. Statistical schema induction systems may leverage structural knowledge encoded in discourse or the causal graphs associated with event meaning, however resources to study such causal structure are few in number and limited in size. In this work, we investigate how to apply schema induction models to the task of knowledge discovery for enhanced search of English-language news texts. To tackle the problem of data scarcity, we present Torquestra, a manually curated dataset of text-graph-schema units integrating temporal, event, and causal structures. We benchmark our dataset on three knowledge discovery tasks, building and evaluating models for each. Results show that systems that harness causal structure are effective at identifying texts sharing similar causal meaning components rather than relying on lexical cues alone. We make our dataset and models available for research purposes. △ Less

Submitted 27 March, 2023; originally announced March 2023.

Comments: 8 pages, appendix

arXiv:2303.14342 [pdf, other]

Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error Correction

Authors: Steven Coyne, Keisuke Sakaguchi, Diana Galvan-Sosa, Michael Zock, Kentaro Inui

Abstract: GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks. However, there is a relative lack of detailed published analysis of their performance on the task of grammatical error correction (GEC). To address this, we perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC b… ▽ More GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks. However, there is a relative lack of detailed published analysis of their performance on the task of grammatical error correction (GEC). To address this, we perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC benchmarks. We compare the performance of different prompts in both zero-shot and few-shot settings, analyzing intriguing or problematic outputs encountered with different prompt formats. We report the performance of our best prompt on the BEA-2019 and JFLEG datasets, finding that the GPT models can perform well in a sentence-level revision setting, with GPT-4 achieving a new high score on the JFLEG benchmark. Through human evaluation experiments, we compare the GPT models' corrections to source, human reference, and baseline GEC system sentences and observe differences in editing strategies and how they are scored by human raters. △ Less

Submitted 30 May, 2023; v1 submitted 24 March, 2023; originally announced March 2023.

arXiv:2302.08148 [pdf, other]

Empirical Investigation of Neural Symbolic Reasoning Strategies

Authors: Yoichi Aoki, Keito Kudo, Tatsuki Kuribayashi, Ana Brassard, Masashi Yoshikawa, Keisuke Sakaguchi, Kentaro Inui

Abstract: Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear. Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1,… ▽ More Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear. Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of reasoning strategies significantly affects the performance, with the gap becoming even larger as the extrapolation length becomes longer. Surprisingly, we also found that certain configurations lead to nearly perfect performance, even in the case of length extrapolation. Our results indicate the importance of further exploring effective strategies for neural reasoning models. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Comments: This paper is accepted as the findings at EACL 2023, and the earlier version (non-archival) of this work got the Best Paper Award in the Student Research Workshop of AACL 2022

arXiv:2302.07866 [pdf, other]

Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?

Authors: Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Ana Brassard, Masashi Yoshikawa, Keisuke Sakaguchi, Kentaro Inui

Abstract: Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a ski… ▽ More Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a skill tree on compositionality in arithmetic symbolic reasoning that defines the hierarchical levels of complexity along with three compositionality dimensions: systematicity, productivity, and substitutivity. Our experiments revealed that among the three types of composition, the models struggled most with systematicity, performing poorly even with relatively simple compositions. That difficulty was not resolved even after training the models with intermediate reasoning steps. △ Less

Submitted 15 February, 2023; originally announced February 2023.

Comments: accepted by EACL 2023

arXiv:2212.09246 [pdf, other]

I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation

Authors: Chandra Bhagavatula, Jena D. Hwang, Doug Downey, Ronan Le Bras, Ximing Lu, Lianhui Qin, Keisuke Sakaguchi, Swabha Swayamdipta, Peter West, Yejin Choi

Abstract: Commonsense capabilities of pre-trained language models dramatically improve with scale, leading many to believe that scale is the only winning recipe. But is it? Here, we investigate an alternative that a priori seems impossible: can smaller language models (e.g., GPT-2) win over models that are orders of magnitude larger and better (e.g., GPT-3), if powered with novel commonsense distillation al… ▽ More Commonsense capabilities of pre-trained language models dramatically improve with scale, leading many to believe that scale is the only winning recipe. But is it? Here, we investigate an alternative that a priori seems impossible: can smaller language models (e.g., GPT-2) win over models that are orders of magnitude larger and better (e.g., GPT-3), if powered with novel commonsense distillation algorithms? The key intellectual challenge is to design a learning algorithm that achieve a competitive level of commonsense acquisition, without relying on the benefits of scale. In particular, we study generative models of commonsense knowledge, focusing on the task of generating generics, statements of commonsense facts about everyday concepts, e.g., birds can fly. We introduce I2D2, a novel commonsense distillation framework that loosely follows the Symbolic Knowledge Distillation of West et al. but breaks the dependence on the extreme-scale teacher model with two innovations: (1) the novel adaptation of NeuroLogic Decoding to enhance the generation quality of the weak, off-the-shelf language models, and (2) self-imitation learning to iteratively learn from the model's own enhanced commonsense acquisition capabilities. Empirical results suggest that scale is not the only way, as novel algorithms can be a promising alternative. Moreover, our study leads to a new corpus of generics, Gen-A-tomic, that is the largest and highest quality available to date. △ Less

Submitted 26 May, 2023; v1 submitted 18 December, 2022; originally announced December 2022.

Comments: ACL 2023

arXiv:2211.14686 [pdf, other]

Towards a Decentralized Metaverse: Synchronized Orchestration of Digital Twins and Sub-Metaverses

Authors: Omar Hashash, Christina Chaccour, Walid Saad, Kei Sakaguchi, Tao Yu

Abstract: Accommodating digital twins (DTs) in the metaverse is essential to achieving digital reality. This need for integrating DTs into the metaverse while operating them at the network edge has increased the demand for a decentralized edge-enabled metaverse. Hence, to consolidate the fusion between real and digital entities, it is necessary to harmonize the interoperability between DTs and the metaverse… ▽ More Accommodating digital twins (DTs) in the metaverse is essential to achieving digital reality. This need for integrating DTs into the metaverse while operating them at the network edge has increased the demand for a decentralized edge-enabled metaverse. Hence, to consolidate the fusion between real and digital entities, it is necessary to harmonize the interoperability between DTs and the metaverse at the edge. In this paper, a novel decentralized metaverse framework that incorporates DT operations at the wireless edge is presented. In particular, a system of autonomous physical twins (PTs) operating in a massively-sensed zone is replicated as cyber twins (CTs) at the mobile edge computing (MEC) servers. To render the CTs' digital environment, this zone is partitioned and teleported as distributed sub-metaverses to the MEC servers. To guarantee seamless synchronization of the sub-metaverses and their associated CTs with the dynamics of the real world and PTs, respectively, this joint synchronization problem is posed as an optimization problem whose goal is to minimize the average sub-synchronization time between the real and digital worlds, while meeting the DT synchronization intensity requirements. To solve this problem, a novel iterative algorithm for joint sub-metaverse and DT association at the MEC servers is proposed. This algorithm exploits the rigorous framework of optimal transport theory so as to efficiently distribute the sub-metaverses and DTs, while considering the computing and communication resource allocations. Simulation results show that the proposed solution can orchestrate the interplay between DTs and sub-metaverses to achieve a 25.75 % reduction in the sub-synchronization time in comparison to the signal-to-noise ratio-based association scheme. △ Less

Submitted 26 November, 2022; originally announced November 2022.

arXiv:2211.02295 [pdf]

Experiment of Multi-UAV Full-Duplex System Equipped with Directional Antennas

Authors: Tao Yu, Kento Kajiwara, Kiyomichi Araki, Kei Sakaguchi

Abstract: One of the key enablers for the realization of a variety of unmanned aerial vehicle (UAV)-based systems is the high-performance communication system linking many UAVs and ground station. We have proposed a spectrum-efficient full-duplex directional-antennas-equipped multi-UAV communication system with low hardware complexity to address the issues of low spectrum efficiency caused by co-channel int… ▽ More One of the key enablers for the realization of a variety of unmanned aerial vehicle (UAV)-based systems is the high-performance communication system linking many UAVs and ground station. We have proposed a spectrum-efficient full-duplex directional-antennas-equipped multi-UAV communication system with low hardware complexity to address the issues of low spectrum efficiency caused by co-channel interference in areal channels. In this paper, by using the prototype system including UAVs and ground station, field experiments are carried out to confirm the feasibility and effectiveness of the proposed system's key feature, i.e., co-channel interference cancellation among UAVs by directional antennas and UAV relative position control, instead of energy-consuming dedicated self-interference cancellers on UAVs in traditional full-duplex systems. Both uplink and downlink performance are tested. Specially, in downlink experiment, channel power of interference between a pair of two UAVs is measured when UAVs are in different positional relationships. The experiment results agree well with the designs and confirm that the proposed system can greatly improve the system performance. △ Less

Submitted 4 November, 2022; originally announced November 2022.

Comments: The paper was accepted by IEEE Consumer Communications & Networking Conference (CCNC) 2023

arXiv:2207.13332 [pdf, other]

RealTime QA: What's the Answer Right Now?

Authors: Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir Radev, Noah A. Smith, Yejin Choi, Kentaro Inui

Abstract: We introduce REALTIME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). REALTIME QA inquires about the current world, and QA systems need to answer questions about novel events or information. It therefore challenges static, conventional assumptions in open-domain QA datasets and pursues instantaneous applicat… ▽ More We introduce REALTIME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis (weekly in this version). REALTIME QA inquires about the current world, and QA systems need to answer questions about novel events or information. It therefore challenges static, conventional assumptions in open-domain QA datasets and pursues instantaneous applications. We build strong baseline models upon large pretrained language models, including GPT-3 and T5. Our benchmark is an ongoing effort, and this paper presents real-time evaluation results over the past year. Our experimental results show that GPT-3 can often properly update its generation results, based on newly-retrieved documents, highlighting the importance of up-to-date information retrieval. Nonetheless, we find that GPT-3 tends to return outdated answers when retrieved documents do not provide sufficient information to find an answer. This suggests an important avenue for future research: can an open-domain QA system identify such unanswerable cases and communicate with the user or even the retrieval module to modify the retrieval results? We hope that REALTIME QA will spur progress in instantaneous applications of question answering and beyond. △ Less

Submitted 28 February, 2024; v1 submitted 27 July, 2022; originally announced July 2022.

Comments: RealTime QA Website: https://realtimeqa.github.io/

arXiv:2205.11484 [pdf, other]

Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond

Authors: Masato Mita, Keisuke Sakaguchi, Masato Hagiwara, Tomoya Mizumoto, Jun Suzuki, Kentaro Inui

Abstract: Natural language processing technology has rapidly improved automated grammatical error correction tasks, and the community begins to explore document-level revision as one of the next challenges. To go beyond sentence-level automated grammatical error correction to NLP-based document-level revision assistant, there are two major obstacles: (1) there are few public corpora with document-level revi… ▽ More Natural language processing technology has rapidly improved automated grammatical error correction tasks, and the community begins to explore document-level revision as one of the next challenges. To go beyond sentence-level automated grammatical error correction to NLP-based document-level revision assistant, there are two major obstacles: (1) there are few public corpora with document-level revisions being annotated by professional editors, and (2) it is not feasible to elicit all possible references and evaluate the quality of revision with such references because there are infinite possibilities of revision. This paper tackles these challenges. First, we introduce a new document-revision corpus, TETRA, where professional editors revised academic papers sampled from the ACL anthology which contain few trivial grammatical errors that enable us to focus more on document- and paragraph-level edits such as coherence and consistency. Second, we explore reference-less and interpretable methods for meta-evaluation that can detect quality improvements by document revision. We show the uniqueness of TETRA compared with existing document revision corpora and demonstrate that a fine-tuned pre-trained language model can discriminate the quality of documents after revision even when the difference is subtle. This promising result will encourage the community to further explore automated document revision models and metrics in future. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: 14 pages

arXiv:2205.09273 [pdf, other]

Twist Decoding: Diverse Generators Guide Each Other

Authors: Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Hao Peng, Ximing Lu, Dragomir Radev, Yejin Choi, Noah A. Smith

Abstract: Many language generation models are now available for a wide range of generation tasks, including machine translation and summarization. Combining such diverse models may lead to further progress, but ensembling generation models is challenging during inference: conventional ensembling methods (e.g., shallow fusion) require that the models share vocabulary/tokenization schemes. We introduce Twist… ▽ More Many language generation models are now available for a wide range of generation tasks, including machine translation and summarization. Combining such diverse models may lead to further progress, but ensembling generation models is challenging during inference: conventional ensembling methods (e.g., shallow fusion) require that the models share vocabulary/tokenization schemes. We introduce Twist decoding, a simple and general text generation algorithm that benefits from diverse models at inference time. Our method does not assume the vocabulary, tokenization or even generation order is shared. Our extensive evaluations on machine translation and scientific paper summarization demonstrate that Twist decoding substantially outperforms each model decoded in isolation over various scenarios, including cases where domain-specific and general-purpose models are both available. Twist decoding also consistently outperforms the popular reranking heuristic where output candidates from one model are rescored by another. We hope that our work will encourage researchers and practitioners to examine generation models collectively, not just independently, and to seek out models with complementary strengths to the currently available models. Our code is available at https://github.com/jungokasai/twist_decoding. △ Less

Submitted 28 October, 2022; v1 submitted 18 May, 2022; originally announced May 2022.

Comments: Proc. of EMNLP 2022

arXiv:2205.00395 [pdf, other]

ELQA: A Corpus of Metalinguistic Questions and Answers about English

Authors: Shabnam Behzad, Keisuke Sakaguchi, Nathan Schneider, Amir Zeldes

Abstract: We present ELQA, a corpus of questions and answers in and about the English language. Collected from two online forums, the >70k questions (from English learners and others) cover wide-ranging topics including grammar, meaning, fluency, and etymology. The answers include descriptions of general properties of English vocabulary and grammar as well as explanations about specific (correct and incorre… ▽ More We present ELQA, a corpus of questions and answers in and about the English language. Collected from two online forums, the >70k questions (from English learners and others) cover wide-ranging topics including grammar, meaning, fluency, and etymology. The answers include descriptions of general properties of English vocabulary and grammar as well as explanations about specific (correct and incorrect) usage examples. Unlike most NLP datasets, this corpus is metalinguistic -- it consists of language about language. As such, it can facilitate investigations of the metalinguistic capabilities of NLU models, as well as educational applications in the language learning domain. To study this, we define a free-form question answering task on our dataset and conduct evaluations on multiple LLMs (Large Language Models) to analyze their capacity to generate metalinguistic answers. △ Less

Submitted 3 July, 2023; v1 submitted 1 May, 2022; originally announced May 2022.

Comments: Accepted to ACL 2023

arXiv:2204.05424 [pdf, other]

A Call for Clarity in Beam Search: How It Works and When It Stops

Authors: Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Dragomir Radev, Yejin Choi, Noah A. Smith

Abstract: Text generation with beam search has proven successful in a wide range of applications. We point out that, though largely overlooked in the literature, the commonly-used implementation of beam decoding (e.g., Hugging Face Transformers and fairseq) uses a first come, first served heuristic: it keeps a set of already completed sequences over time steps and stops when the size of this set reaches the… ▽ More Text generation with beam search has proven successful in a wide range of applications. We point out that, though largely overlooked in the literature, the commonly-used implementation of beam decoding (e.g., Hugging Face Transformers and fairseq) uses a first come, first served heuristic: it keeps a set of already completed sequences over time steps and stops when the size of this set reaches the beam size. Based on this finding, we introduce a patience factor, a simple modification to this beam decoding implementation, that generalizes the stopping criterion and provides flexibility to the depth of search. Empirical results demonstrate that adjusting this patience factor improves decoding performance of strong pretrained models on news text summarization and machine translation over diverse language pairs, with a negligible inference slowdown. Our approach only modifies one line of code and can be thus readily incorporated in any implementation. Further, we find that different versions of beam decoding result in large performance differences in summarization, demonstrating the need for clarity in specifying the beam search implementation in research work. Our code will be available upon publication. △ Less

Submitted 28 February, 2024; v1 submitted 11 April, 2022; originally announced April 2022.

Comments: LREC-COLING 2024

arXiv:2202.04330 [pdf, other]

Reflexive tactics for algebra, revisited

Authors: Kazuhiko Sakaguchi

Abstract: Computational reflection allows us to turn verified decision procedures into efficient automated reasoning tools in proof assistants. The typical applications of such methodology include mathematical structures that have decidable theory fragments, e.g., equational theories of commutative rings and lattices. However, such existing tools are known not to cooperate with packed classes, a methodology… ▽ More Computational reflection allows us to turn verified decision procedures into efficient automated reasoning tools in proof assistants. The typical applications of such methodology include mathematical structures that have decidable theory fragments, e.g., equational theories of commutative rings and lattices. However, such existing tools are known not to cooperate with packed classes, a methodology to define mathematical structures in dependent type theory, that allows for the sharing of vocabulary across the inheritance hierarchy. Additionally, such tools do not support homomorphisms whose domain and codomain types may differ. This paper demonstrates how to implement reflexive tactics that support packed classes and homomorphisms. As applications of our methodology, we adapt the ring and field tactics of Coq to the commutative ring and field structures of the Mathematical Components library, and apply the resulting tactics to the formal proof of the irrationality of $ζ(3)$ by Chyzak, Mahboubi, and Sibut-Pinote, to bring more proof automation. △ Less

Submitted 9 February, 2022; originally announced February 2022.

Comments: Under review

arXiv:2202.01600 [pdf]

doi 10.1109/VTC2021-Fall52928.2021.9625304

Context-Based MEC Platform for Augmented-Reality Services in 5G Networks

Authors: Yue Wang, Tao Yu, Kei Sakaguchi

Abstract: Augmented reality (AR) has drawn great attention in recent years. However, current AR devices have drawbacks, e.g., weak computation ability and large power consumption. To solve the problem, mobile edge computing (MEC) can be introduced as a key technology to offload data and computation from AR devices to MEC servers via 5th Generation Mobile Communication Technology (5G) networks. To this end,… ▽ More Augmented reality (AR) has drawn great attention in recent years. However, current AR devices have drawbacks, e.g., weak computation ability and large power consumption. To solve the problem, mobile edge computing (MEC) can be introduced as a key technology to offload data and computation from AR devices to MEC servers via 5th Generation Mobile Communication Technology (5G) networks. To this end, a context-based MEC platform for AR services in 5G networks is proposed in this paper. On the platform, MEC is employed as a data processing center while AR devices are simplified as universal input/output devices, which overcomes their limitations and achieves better user experience. Moreover, the proof-of-concept (PoC) hardware prototype of the platform, and two typical use cases providing AR services of navigation and face recognition respectively are implemented to demonstrate the feasibility and effectiveness of the platform. Finally, the performance of the platform is also numerically evaluated, and the results validate the system design and agree well with the design expectations. △ Less

Submitted 3 February, 2022; originally announced February 2022.

Comments: Accepted in VTC 2021 Fall

arXiv:2202.00177 [pdf]

Spectrum Sharing between Directional-Antenna- Equipped UAV System and Terrestrial Systems

Authors: Tao Yu, Kento Kajiwara, Kiyomichi Araki, Kei Sakaguchi

Abstract: Unmanned aerial vehicles (UAVs)-based applications, such as surveillance systems and wireless relays, are attracting increasing attention from academia and industrial fields. The high-performance aerial communication system is one of the key enablers for them. However, due to the low attenuation of radio waves in the air-to-ground channels, the interference between aerial and terrestrial communica… ▽ More Unmanned aerial vehicles (UAVs)-based applications, such as surveillance systems and wireless relays, are attracting increasing attention from academia and industrial fields. The high-performance aerial communication system is one of the key enablers for them. However, due to the low attenuation of radio waves in the air-to-ground channels, the interference between aerial and terrestrial communication systems would significantly deteriorate their communication performance and greatly limit the potential UAV applications. To address the problem, in this paper, the spectrum sharing strategy between a multiple UAV communication system, in which both UAVs and ground station (GS) are equipped with directional antennas, and terrestrial systems is proposed. The GS position is selected and the flyable areas of the UAVs using certain spectrum resources are defined in advance using prior knowledge from spectrum monitoring on terrestrial communication systems to minimize interference and maximize the flyable areas of the UAVs instead of the low-efficient dynamic channel sensing and allocation for interference elimination. The simulations are conducted through a case study of the spectrum sharing between a multi-UAV video transmission system and the terrestrial wireless local area network (WLAN) system in the 5.7GHz band. The simulation results show that thanks to the proposed system the entire area can be enabled for UAV flight. △ Less

Submitted 31 January, 2022; originally announced February 2022.

Comments: This paper was accepted by IEEE Annual Computing and Communication Workshop and Conference (CCWC) 2022

arXiv:2202.00176 [pdf]

Full-Duplex Aerial Communication System for Multiple UAVs with Directional Antennas

Authors: Tao Yu, Kiyomichi Araki, Kei Sakaguchi

Abstract: UAV-based wireless systems, such as wireless relay and remote sensing, have attracted great attentions from academia and industry. To realize them, a high-performance wireless aerial communication system, which bridges UAVs and ground stations, is one of the key enablers. However, there are still issues hindering its development, such as the severe co-channel interference among UAVs, and the limit… ▽ More UAV-based wireless systems, such as wireless relay and remote sensing, have attracted great attentions from academia and industry. To realize them, a high-performance wireless aerial communication system, which bridges UAVs and ground stations, is one of the key enablers. However, there are still issues hindering its development, such as the severe co-channel interference among UAVs, and the limited payload/battery-life of UAVs. To address the challenges, we propose an aerial communication system which enables system-level full-duplex communication of multiple UAVs with lower hardware complexities than ideal full-duplex communication systems. In the proposed system, each channel is re-assigned to the uplink and downlink of a pair of UAVs, and each UAV employ a pair of separated channels for its uplink and downlink. The co-channel interference between UAVs that reuse same channels is eliminated by exploiting advantages of UAVs' maneuverability and high-gain directional antennas equipped in UAVs and ground stations, so that dedicated cancellers are not necessary in the proposed system. The system design and performance analysis are given, and the simulation results well agree with the designs. △ Less

Submitted 31 January, 2022; originally announced February 2022.

Comments: The paper was accepted by IEEE Consumer Communications & Networking Conference (CCNC) 2022

arXiv:2202.00175 [pdf]

Ground Experiment of Full-Duplex Multi-UAV System Enabled by Directional Antennas

Authors: Tao Yu, Kiyomichi Araki, Kei Sakaguchi

Abstract: A high performance multi-UAV communication system, which bridges multiple UAVs and ground station, is one of the key enablers to realize a variety of UAV-based systems. To address the issues such as the low spectrum efficiency caused by the co-channel interference, we have proposed a spectrum-efficient full-duplex multi-UA V communication system with low hardware complexity. In this paper, on-grou… ▽ More A high performance multi-UAV communication system, which bridges multiple UAVs and ground station, is one of the key enablers to realize a variety of UAV-based systems. To address the issues such as the low spectrum efficiency caused by the co-channel interference, we have proposed a spectrum-efficient full-duplex multi-UA V communication system with low hardware complexity. In this paper, on-ground experiments are conducted to confirm the feasibility and effectiveness of the key feature of the proposed system, i.e., co-channel interference cancellation among UAVs by directional antennas and UAV position control, instead of energy-consuming dedicated self-interference cancellers on UAVs in traditional full-duplex systems. Channel power of interference link between a pair of two UAVs reusing the same channel is measured, and the achievable channel capacity is also measured by a prototype system implemented by software-defined radio devices. The results of different antennas and different antenna heights are also compared. The experimental results agree well with the designs and confirm the feasibility and effectiveness of the proposed system. This ground experiment is a work in progress to provide preliminary results for the multi-UAV-based experiments in the air in the future. △ Less

Submitted 31 January, 2022; originally announced February 2022.

Comments: This paper was accepted by IEEE Annual Computing and Communication Workshop and Conference (CCWC) 2022

arXiv:2112.07867 [pdf, other]

Interscript: A dataset for interactive learning of scripts through error feedback

Authors: Niket Tandon, Aman Madaan, Peter Clark, Keisuke Sakaguchi, Yiming Yang

Abstract: How can an end-user provide feedback if a deployed structured prediction model generates inconsistent output, ignoring the structural complexity of human language? This is an emerging topic with recent progress in synthetic or constrained settings, and the next big leap would require testing and tuning models in real-world settings. We present a new dataset, Interscript, containing user feedback o… ▽ More How can an end-user provide feedback if a deployed structured prediction model generates inconsistent output, ignoring the structural complexity of human language? This is an emerging topic with recent progress in synthetic or constrained settings, and the next big leap would require testing and tuning models in real-world settings. We present a new dataset, Interscript, containing user feedback on a deployed model that generates complex everyday tasks. Interscript contains 8,466 data points -- the input is a possibly erroneous script and a user feedback, and the output is a modified script. We posit two use-cases of \ours that might significantly advance the state-of-the-art in interactive learning. The dataset is available at: https://github.com/allenai/interscript. △ Less

Submitted 15 December, 2021; v1 submitted 14 December, 2021; originally announced December 2021.

Comments: AAAI'22-Workshop on Interactive Machine Learning

arXiv:2112.04139 [pdf, other]

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Authors: Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Lavinia Dunagan, Jacob Morrison, Alexander R. Fabbri, Yejin Choi, Noah A. Smith

Abstract: Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to depend on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more direc… ▽ More Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to depend on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously tracks progress in language generation models and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. A Billboard automatically creates an ensemble metric that selects and linearly combines a few metrics based on a global analysis across generators. Further, metrics are ranked based on their correlation with human judgments. We release four Billboards for machine translation, summarization, and image captioning. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation. Our mixed-effects model analysis shows that most automatic metrics, especially the reference-based ones, overrate machine over human generation, demonstrating the importance of updating metrics as generation models become stronger (and perhaps more similar to humans) in the future. Our project website is available at https://nlp.cs.washington.edu/billboard/. △ Less

Submitted 18 May, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

Comments: Proc. of NAACL 2022

arXiv:2111.08940 [pdf, other]

Transparent Human Evaluation for Image Captioning

Authors: Jungo Kasai, Keisuke Sakaguchi, Lavinia Dunagan, Jacob Morrison, Ronan Le Bras, Yejin Choi, Noah A. Smith

Abstract: We establish THumB, a rubric-based human evaluation protocol for image captioning models. Our scoring rubrics and their definitions are carefully developed based on machine- and human-generated captions on the MSCOCO dataset. Each caption is evaluated along two main dimensions in a tradeoff (precision and recall) as well as other aspects that measure the text quality (fluency, conciseness, and inc… ▽ More We establish THumB, a rubric-based human evaluation protocol for image captioning models. Our scoring rubrics and their definitions are carefully developed based on machine- and human-generated captions on the MSCOCO dataset. Each caption is evaluated along two main dimensions in a tradeoff (precision and recall) as well as other aspects that measure the text quality (fluency, conciseness, and inclusive language). Our evaluations demonstrate several critical problems of the current evaluation practice. Human-generated captions show substantially higher quality than machine-generated ones, especially in coverage of salient information (i.e., recall), while most automatic metrics say the opposite. Our rubric-based results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall. We hope that this work will promote a more transparent evaluation protocol for image captioning and its automatic metrics. △ Less

Submitted 18 May, 2022; v1 submitted 17 November, 2021; originally announced November 2021.

Comments: Proc. of NAACL 2022

arXiv:2111.02609 [pdf, ps, other]

doi 10.1103/PhysRevA.105.013312

Hydrodynamic generation of skyrmions in a two-component Bose-Einstein condensate

Authors: Kyoshiro Sakaguchi, Keisuke Jimbo, Hiroki Saito

Abstract: When an obstacle is moved in a superfluid faster than a critical velocity, quantized vortices are generated behind the obstacle. Here we propose a method to create more complicated topological excitations, three-dimensional skyrmions, behind a moving obstacle. We numerically show that, in a two-component Bose-Einstein condensate, component-dependent obstacle potentials can generate skyrmions in th… ▽ More When an obstacle is moved in a superfluid faster than a critical velocity, quantized vortices are generated behind the obstacle. Here we propose a method to create more complicated topological excitations, three-dimensional skyrmions, behind a moving obstacle. We numerically show that, in a two-component Bose-Einstein condensate, component-dependent obstacle potentials can generate skyrmions in the wake, made up of quantized vortex rings in different components that are linked with each other. The lifetime of generated skyrmions can be prolonged by a guiding potential, which enables the formation of a skyrmion train. △ Less

Submitted 3 November, 2021; originally announced November 2021.

Comments: 7 pages, 5 figures, 8 movies

arXiv:2110.07574 [pdf, other]

Can Machines Learn Morality? The Delphi Experiment

Authors: Liwei Jiang, Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jenny Liang, Jesse Dodge, Keisuke Sakaguchi, Maxwell Forbes, Jon Borchardt, Saadia Gabriel, Yulia Tsvetkov, Oren Etzioni, Maarten Sap, Regina Rini, Yejin Choi

Abstract: As AI systems become increasingly powerful and pervasive, there are growing concerns about machines' morality or a lack thereof. Yet, teaching morality to machines is a formidable task, as morality remains among the most intensely debated questions in humanity, let alone for AI. Existing AI systems deployed to millions of users, however, are already making decisions loaded with moral implications,… ▽ More As AI systems become increasingly powerful and pervasive, there are growing concerns about machines' morality or a lack thereof. Yet, teaching morality to machines is a formidable task, as morality remains among the most intensely debated questions in humanity, let alone for AI. Existing AI systems deployed to millions of users, however, are already making decisions loaded with moral implications, which poses a seemingly impossible challenge: teaching machines moral sense, while humanity continues to grapple with it. To explore this challenge, we introduce Delphi, an experimental framework based on deep neural networks trained directly to reason about descriptive ethical judgments, e.g., "helping a friend" is generally good, while "helping a friend spread fake news" is not. Empirical results shed novel insights on the promises and limits of machine ethics; Delphi demonstrates strong generalization capabilities in the face of novel ethical situations, while off-the-shelf neural network models exhibit markedly poor judgment including unjust biases, confirming the need for explicitly teaching machines moral sense. Yet, Delphi is not perfect, exhibiting susceptibility to pervasive biases and inconsistencies. Despite that, we demonstrate positive use cases of imperfect Delphi, including using it as a component model within other imperfect AI systems. Importantly, we interpret the operationalization of Delphi in light of prominent ethical theories, which leads us to important future research questions. △ Less

Submitted 12 July, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

Showing 1–50 of 86 results for author: Sakaguchi, K