-
TraceSim: A Method for Calculating Stack Trace Similarity
Authors:
Roman Vasiliev,
Dmitrij Koznov,
George Chernishev,
Aleksandr Khvorov,
Dmitry Luciv,
Nikita Povarov
Abstract:
Many contemporary software products have subsystems for automatic crash reporting. However, it is well-known that the same bug can produce slightly different reports. To manage this problem, reports are usually grouped, often manually by developers. Manual triaging, however, becomes infeasible for products that have large userbases, which is the reason for many different approaches to automating t…
▽ More
Many contemporary software products have subsystems for automatic crash reporting. However, it is well-known that the same bug can produce slightly different reports. To manage this problem, reports are usually grouped, often manually by developers. Manual triaging, however, becomes infeasible for products that have large userbases, which is the reason for many different approaches to automating this task. Moreover, it is important to improve quality of triaging due to the big volume of reports that needs to be processed properly. Therefore, even a relatively small improvement could play a significant role in overall accuracy of report bucketing. The majority of existing studies use some kind of a stack trace similarity metric, either based on information retrieval techniques or string matching methods. However, it should be stressed that the quality of triaging is still insufficient. In this paper, we describe TraceSim -- a novel approach to address this problem which combines TF-IDF, Levenshtein distance, and machine learning to construct a similarity metric. Our metric has been implemented inside an industrial-grade report triaging system. The evaluation on a manually labeled dataset shows significantly better results compared to baseline approaches.
△ Less
Submitted 26 September, 2020;
originally announced September 2020.
-
Interactive Duplicate Search in Software Documentation
Authors:
D. V. Luciv,
D. V. Koznov,
A. A. Shelikhovskii,
K. Yu. Romanovsky,
G. A. Chernishev,
A. N. Terekhov,
D. A. Grigoriev,
A. N. Smirnova,
D. V. Borovkov,
A. I. Vasenina
Abstract:
Various software features such as classes, methods, requirements, and tests often have similar functionality. This can lead to emergence of duplicates in their descriptive documentation. Uncontrolled duplicates created via copy/paste hinder the process of documentation maintenance. Therefore, the task of duplicate detection in software documentation is of importance. Solving it makes planned reuse…
▽ More
Various software features such as classes, methods, requirements, and tests often have similar functionality. This can lead to emergence of duplicates in their descriptive documentation. Uncontrolled duplicates created via copy/paste hinder the process of documentation maintenance. Therefore, the task of duplicate detection in software documentation is of importance. Solving it makes planned reuse possible, as well as creating and using templates for unification and automatic generation of documentation. In this paper, we present an interactive process for duplicate detection that involves the user in order to conduct meaningful search. It includes a new formal definition of a near duplicate, a pattern-based, and the proof of its completeness. Moreover, we demonstrate the results of experimenting on a collection of documents of several industrial projects.
△ Less
Submitted 22 August, 2019;
originally announced August 2019.
-
Detecting Near Duplicates in Software Documentation
Authors:
D. V. Luciv,
D. V. Koznov,
G. A. Chernishev,
A. N. Terekhov
Abstract:
Contemporary software documentation is as complicated as the software itself. During its lifecycle, the documentation accumulates a lot of near duplicate fragments, i.e. chunks of text that were copied from a single source and were later modified in different ways. Such near duplicates decrease documentation quality and thus hamper its further utilization. At the same time, they are hard to detect…
▽ More
Contemporary software documentation is as complicated as the software itself. During its lifecycle, the documentation accumulates a lot of near duplicate fragments, i.e. chunks of text that were copied from a single source and were later modified in different ways. Such near duplicates decrease documentation quality and thus hamper its further utilization. At the same time, they are hard to detect manually due to their fuzzy nature. In this paper we give a formal definition of near duplicates and present an algorithm for their detection in software documents. This algorithm is based on the exact software clone detection approach: the software clone detection tool Clone Miner was adapted to detect exact duplicates in documents. Then, our algorithm uses these exact duplicates to construct near ones. We evaluate the proposed algorithm using the documentation of 19 open source and commercial projects. Our evaluation is very comprehensive - it covers various documentation types: design and requirement specifications, programming guides and API documentation, user manuals. Overall, the evaluation shows that all kinds of software documentation contain a significant number of both exact and near duplicates. Next, we report on the performed manual analysis of the detected near duplicates for the Linux Kernel Documentation. We present both quantative and qualitative results of this analysis, demonstrate algorithm strengths and weaknesses, and discuss the benefits of duplicate management in software documents.
△ Less
Submitted 13 November, 2017;
originally announced November 2017.