Precise Model Benchmarking with Only a Few Observations

Fogliato, Riccardo; Patil, Pratik; Akpinar, Nil-Jana; Monfort, Mathew

Computer Science > Machine Learning

arXiv:2410.05222 (cs)

[Submitted on 7 Oct 2024]

Title:Precise Model Benchmarking with Only a Few Observations

Authors:Riccardo Fogliato, Pratik Patil, Nil-Jana Akpinar, Mathew Monfort

View PDF HTML (experimental)

Abstract:How can we precisely estimate a large language model's (LLM) accuracy on questions belonging to a specific topic within a larger question-answering dataset? The standard direct estimator, which averages the model's accuracy on the questions in each subgroup, may exhibit high variance for subgroups (topics) with small sample sizes. Synthetic regression modeling, which leverages the model's accuracy on questions about other topics, may yield biased estimates that are too unreliable for large subgroups. We prescribe a simple yet effective solution: an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately, improving the precision of subgroup-level estimates of model performance. Our experiments on multiple datasets show that this approach consistently provides more precise estimates of the LLM performance compared to the direct and regression approaches, achieving substantial reductions in the mean squared error. Confidence intervals for EB estimates also have near-nominal coverage and are narrower compared to those for the direct estimator. Additional experiments on tabular and vision data validate the benefits of this EB approach.

Comments:	To appear at EMNLP 2024
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
Cite as:	arXiv:2410.05222 [cs.LG]
	(or arXiv:2410.05222v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2410.05222

Submission history

From: Riccardo Fogliato [view email]
[v1] Mon, 7 Oct 2024 17:26:31 UTC (137 KB)

Computer Science > Machine Learning

Title:Precise Model Benchmarking with Only a Few Observations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Precise Model Benchmarking with Only a Few Observations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators