SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs

Kokane, Shirley; Zhu, Ming; Awalgaonkar, Tulika; Zhang, Jianguo; Hoang, Thai; Prabhakar, Akshara; Liu, Zuxin; Lan, Tian; Yang, Liangwei; Tan, Juntao; Murthy, Rithesh; Yao, Weiran; Liu, Zhiwei; Niebles, Juan Carlos; Wang, Huan; Heinecke, Shelby; Xiong, Caiming; Savarese, Silivo

Computer Science > Software Engineering

arXiv:2411.13547 (cs)

[Submitted on 20 Nov 2024]

Title:SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs

Authors:Shirley Kokane, Ming Zhu, Tulika Awalgaonkar, Jianguo Zhang, Thai Hoang, Akshara Prabhakar, Zuxin Liu, Tian Lan, Liangwei Yang, Juntao Tan, Rithesh Murthy, Weiran Yao, Zhiwei Liu, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong, Silivo Savarese

View PDF HTML (experimental)

Abstract:Evaluating the output of Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce SpecTool, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using SPECTOOL , we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.

Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2411.13547 [cs.SE]
	(or arXiv:2411.13547v1 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2411.13547

Submission history

From: Shirley Kokane [view email]
[v1] Wed, 20 Nov 2024 18:56:22 UTC (932 KB)

Computer Science > Software Engineering

Title:SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators