SCC: Automatic Classification of Code Snippets

Alreshedy, Kamel; Dharmaretnam, Dhanush; German, Daniel M.; Srinivasan, Venkatesh; Gulliver, T. Aaron

Computer Science > Software Engineering

arXiv:1809.07945 (cs)

[Submitted on 21 Sep 2018]

Title:SCC: Automatic Classification of Code Snippets

Authors:Kamel Alreshedy, Dhanush Dharmaretnam, Daniel M. German, Venkatesh Srinivasan, T. Aaron Gulliver

View PDF

Abstract:Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we describe Source Code Classification (SCC), a classifier that can identify the programming language of code snippets written in 21 different programming languages. A Multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts. It is shown to achieve an accuracy of 75% which is higher than that with Programming Languages Identification (PLI a proprietary online classifier of snippets) whose accuracy is only 55.5%. The average score for precision, recall and the F1 score with the proposed tool are 0.76, 0.75 and 0.75, respectively. In addition, it can distinguish between code snippets from a family of programming languages such as C, C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0.

Subjects:	Software Engineering (cs.SE); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:1809.07945 [cs.SE]
	(or arXiv:1809.07945v1 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.1809.07945
Journal reference:	Working Conference on Source Code Analysis & Manipulation 2018

Submission history

From: Kamel Alrashedy [view email]
[v1] Fri, 21 Sep 2018 04:50:40 UTC (3,946 KB)

Computer Science > Software Engineering

Title:SCC: Automatic Classification of Code Snippets

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:SCC: Automatic Classification of Code Snippets

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators