SimulBench: Evaluating Language Models with Creative Simulation Tasks

Jia, Qi; Yue, Xiang; Zheng, Tianyu; Huang, Jie; Lin, Bill Yuchen

Computer Science > Computation and Language

arXiv:2409.07641 (cs)

[Submitted on 11 Sep 2024]

Title:SimulBench: Evaluating Language Models with Creative Simulation Tasks

Authors:Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin

View PDF HTML (experimental)

Abstract:We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM's general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with an LLM to collect dialogues first under different tasks. Then, challenging dialogue scripts are extracted for evaluating different target LLMs. To facilitate automatic assessment on \DataName{}, GPT-4 is employed as the evaluator, tasked with reviewing the quality of the final response generated by the target LLMs given multi-turn dialogue scripts. Our comprehensive experiments indicate that these simulation tasks continue to pose a significant challenge with their unique natures and show the gap between proprietary models and the most advanced open LLMs. For example, GPT-4-turbo outperforms LLaMA-3-70b-Chat on 18.55\% more cases.

Comments:	Website: this https URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2409.07641 [cs.CL]
	(or arXiv:2409.07641v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2409.07641

Submission history

From: Bill Yuchen Lin [view email]
[v1] Wed, 11 Sep 2024 21:53:20 UTC (1,072 KB)

Computer Science > Computation and Language

Title:SimulBench: Evaluating Language Models with Creative Simulation Tasks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:SimulBench: Evaluating Language Models with Creative Simulation Tasks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators