Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Patel, Nisarg; Kulkarni, Mohith; Parmar, Mihir; Budhiraja, Aashna; Nakamura, Mutsumi; Varshney, Neeraj; Baral, Chitta

Computer Science > Computation and Language

arXiv:2406.17169 (cs)

[Submitted on 24 Jun 2024 (v1), last revised 7 Oct 2024 (this version, v3)]

Title:Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Authors:Nisarg Patel, Mohith Kulkarni, Mihir Parmar, Aashna Budhiraja, Mutsumi Nakamura, Neeraj Varshney, Chitta Baral

View PDF HTML (experimental)

Abstract:As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of datasets for evaluating non-monotonic reasoning represents a crucial gap since it aligns more closely with human-like reasoning. To address these limitations, we propose Multi-LogiEval, a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths. Multi-LogiEval covers three logic types--propositional, first-order, and non-monotonic--consisting of more than 30 inference rules and more than 60 of their combinations with various depths. Leveraging this dataset, we conduct evaluations on a range of LLMs including GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, and Mistral, employing a zero-shot chain-of-thought. Experimental results show that there is a significant drop in the performance of LLMs as the reasoning steps/depth increases (average accuracy of ~68% at depth-1 to ~43% at depth-5). We further conduct a thorough investigation of reasoning chains generated by LLMs which reveals several important findings. We believe that Multi-LogiEval facilitates future research for evaluating and enhancing the logical reasoning ability of LLMs. Data is available at this https URL.

Comments:	Accepted at EMNLP 2024 Main
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.17169 [cs.CL]
	(or arXiv:2406.17169v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.17169

Submission history

From: Mihir Parmar [view email]
[v1] Mon, 24 Jun 2024 23:02:56 UTC (278 KB)
[v2] Fri, 4 Oct 2024 05:00:13 UTC (282 KB)
[v3] Mon, 7 Oct 2024 03:48:18 UTC (282 KB)

Computer Science > Computation and Language

Title:Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators