Calibration and Correctness of Language Models for Code

Spiess, Claudio; Gros, David; Pai, Kunal Suresh; Pradel, Michael; Rabin, Md Rafiqul Islam; Alipour, Amin; Jha, Susmit; Devanbu, Prem; Ahmed, Toufique

Computer Science > Software Engineering

arXiv:2402.02047 (cs)

[Submitted on 3 Feb 2024 (v1), last revised 21 Aug 2024 (this version, v4)]

Title:Calibration and Correctness of Language Models for Code

Authors:Claudio Spiess, David Gros, Kunal Suresh Pai, Michael Pradel, Md Rafiqul Islam Rabin, Amin Alipour, Susmit Jha, Prem Devanbu, Toufique Ahmed

View PDF HTML (experimental)

Abstract:Machine learning models are widely used, but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated.
A well-calibrated confidence measure can serve as a basis for rational, graduated decision-making on how much review and care is needed when using generated code. Calibration has so far been studied in mostly non-generative (e.g. classification) settings, especially in software engineering. However, generated code can quite often be wrong: Given generated code, developers must decide whether to use directly, use after varying intensity of careful review, or discard model-generated code. Thus, calibration is vital in generative settings.
We make several contributions. We develop a framework for evaluating the calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that, by and large, generative code models we test are not well-calibrated out of the box. We then show how calibration can be improved using standard methods, such as Platt scaling. Since Platt scaling relies on the prior availability of correctness data, we evaluate the applicability and generalizability of Platt scaling in software engineering, discuss settings where it has good potential for practical use, and settings where it does not. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in software engineering.

Comments:	Published in ICSE'25
Subjects:	Software Engineering (cs.SE); Machine Learning (cs.LG)
Cite as:	arXiv:2402.02047 [cs.SE]
	(or arXiv:2402.02047v4 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2402.02047

Submission history

From: Claudio Spiess [view email]
[v1] Sat, 3 Feb 2024 05:52:28 UTC (491 KB)
[v2] Fri, 9 Feb 2024 22:18:05 UTC (491 KB)
[v3] Fri, 16 Feb 2024 22:58:07 UTC (467 KB)
[v4] Wed, 21 Aug 2024 01:58:38 UTC (584 KB)

Computer Science > Software Engineering

Title:Calibration and Correctness of Language Models for Code

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:Calibration and Correctness of Language Models for Code

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators