Benchmarking and Rethinking Knowledge Editing for Large Language Models

He, Guoxiu; Song, Xin; Wang, Futing; Sun, Aixin

Abstract:Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, under a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also evaluate multi-edit scenarios to better reflect practical demands. We employ four evaluation dimensions, including portability, and compare all recent methods against a simple and straightforward baseline named Selective Contextual Reasoning (SCR). Empirical results reveal that parameter-based editing methods perform poorly under realistic conditions. In contrast, SCR consistently outperforms them across all settings. This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.

Comments:	arXiv admin note: text overlap with arXiv:2503.05212
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2505.18690 [cs.CL]
	(or arXiv:2505.18690v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2505.18690

Computer Science > Computation and Language

Title:Benchmarking and Rethinking Knowledge Editing for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators