Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning

Bao, Qiming; Gendron, Gael; Peng, Alex Yuxuan; Zhong, Wanjun; Tan, Neset; Chen, Yang; Witbrock, Michael; Liu, Jiamou

Computer Science > Computation and Language

arXiv:2310.09430 (cs)

[Submitted on 13 Oct 2023 (v1), last revised 17 Jan 2025 (this version, v5)]

Title:Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning

Authors:Qiming Bao, Gael Gendron, Alex Yuxuan Peng, Wanjun Zhong, Neset Tan, Yang Chen, Michael Witbrock, Jiamou Liu

View PDF HTML (experimental)

Abstract:Large language models (LLMs), such as LLaMA, Alpaca, Vicuna, GPT-3.5 and GPT-4, have advanced the performance of AI systems on various natural language processing tasks to human-like levels. However, their generalisation and robustness when performing logical reasoning has not been sufficiently assessed. To comprehensively evaluate this ability, we develop three new logical reasoning datasets named "ReClor-plus", "LogiQA-plus" and "LogiQAv2-plus" that extend standard logical reasoning datasets to evaluate the robustness of the LLM's reasoning. For each, we create three subsets: the first with randomly shuffled options, the second with the correct choices replaced by "none of the other options is correct", and the third with a combination of shuffling and substitution. Experiments on these datasets show that these simple augmentations greatly hinder the models' performance. Despite their high performance on the original publicly available datasets, we find that all models perform poorly on these newly constructed datasets. We also demonstrate that introducing task variations into the training set can markedly improve the model's performance on both the original and our developed datasets. Finally, we show that applying logic-driven data augmentation for fine-tuning and prompting can enhance generalisation in both discriminative and generative models, offering a path to improving their robustness for tasks involving logical reasoning. Source code and data are made publicly available at this https URL.

Comments:	The short version (v3) was accepted for oral presentation at the first LLM@IJCAI 2023 non-archival symposium, and the full version was accepted by ICONIP 2024
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2310.09430 [cs.CL]
	(or arXiv:2310.09430v5 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.09430

Submission history

From: Qiming Bao [view email]
[v1] Fri, 13 Oct 2023 22:29:15 UTC (7,827 KB)
[v2] Tue, 17 Oct 2023 02:08:24 UTC (7,827 KB)
[v3] Wed, 18 Oct 2023 22:46:12 UTC (7,827 KB)
[v4] Sat, 30 Mar 2024 09:49:19 UTC (89 KB)
[v5] Fri, 17 Jan 2025 04:39:38 UTC (70 KB)

Computer Science > Computation and Language

Title:Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators