Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings

Tsvetkov, Petr; Eliseeva, Aleksandra; Dig, Danny; Bezzubov, Alexander; Golubev, Yaroslav; Bryksin, Timofey; Zharov, Yaroslav

Computer Science > Software Engineering

arXiv:2410.12046 (cs)

[Submitted on 15 Oct 2024 (v1), last revised 8 Jan 2025 (this version, v2)]

Title:Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings

Authors:Petr Tsvetkov, Aleksandra Eliseeva, Danny Dig, Alexander Bezzubov, Yaroslav Golubev, Timofey Bryksin, Yaroslav Zharov

View PDF HTML (experimental)

Abstract:When a Commit Message Generation (CMG) system is integrated into the IDEs and other products at JetBrains, we perform online evaluation based on user acceptance of the generated messages. However, performing online experiments with every change to a CMG system is troublesome, as each iteration affects users and requires time to collect enough statistics. On the other hand, offline evaluation, a prevalent approach in the research literature, facilitates fast experiments but employs automatic metrics that are not guaranteed to represent the preferences of real users. In this work, we describe a novel way we employed to deal with this problem at JetBrains, by leveraging an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments.
To support this new type of evaluation, we develop a novel markup collection tool mimicking the real workflow with a CMG system, collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts, and design and verify a way to synthetically extend such a dataset. Then, we use the final dataset of 656 pairs to study how the widely used similarity metrics correlate with the online metric reflecting the real users' experience.
Our results indicate that edit distance exhibits the highest correlation with the online metric, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation. This contradicts the previous studies on similarity metrics for CMG, suggesting that user interactions with a CMG system in real-world settings differ significantly from the responses by human labelers within controlled environments. We release all the code and the dataset to support future research in the field: this https URL.

Comments:	10 pages, 5 figures (Published at ICSE'2025)
Subjects:	Software Engineering (cs.SE); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Cite as:	arXiv:2410.12046 [cs.SE]
	(or arXiv:2410.12046v2 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2410.12046

Submission history

From: Yaroslav Zharov [view email]
[v1] Tue, 15 Oct 2024 20:32:07 UTC (1,660 KB)
[v2] Wed, 8 Jan 2025 15:35:02 UTC (1,661 KB)

Computer Science > Software Engineering

Title:Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators