-
So Much in So Little: Creating Lightweight Embeddings of Python Libraries
Authors:
Yaroslav Golubev,
Egor Bogomolov,
Egor Bulychev,
Timofey Bryksin
Abstract:
In software engineering, different approaches and machine learning models leverage different types of data: source code, textual information, historical data. An important part of any project is its dependencies. The list of dependencies is relatively small but carries a lot of semantics with it, which can be used to compare projects or make judgements about them.
In this paper, we focus on Pyth…
▽ More
In software engineering, different approaches and machine learning models leverage different types of data: source code, textual information, historical data. An important part of any project is its dependencies. The list of dependencies is relatively small but carries a lot of semantics with it, which can be used to compare projects or make judgements about them.
In this paper, we focus on Python projects and their PyPi dependencies in the form of requirements.txt files. We compile a dataset of 7,132 Python projects and their dependencies, as well as use Git to pull their versions from previous years. Using this data, we build 32-dimensional embeddings of libraries by applying Singular Value Decomposition to the co-occurrence matrix of projects and libraries. We then cluster the embeddings and study their semantic relations.
To showcase the usefulness of such lightweight library embeddings, we introduce a prototype tool for suggesting relevant libraries to a given project. The tool computes project embeddings and uses dependencies of projects with similar embeddings to form suggestions. To compare different library recommenders, we have created a benchmark based on the evolution of dependency sets in open-source projects. Approaches based on the created embeddings significantly outperform the baseline of showing the most popular libraries in a given year. We have also conducted a user study that showed that the suggestions differ in quality for different project domains and that even relevant suggestions might be not particularly useful. Finally, to facilitate potentially more useful recommendations, we extended the recommender system with an option to suggest rarer libraries.
△ Less
Submitted 7 September, 2022;
originally announced September 2022.
-
Identifying collaborators in large codebases
Authors:
Waren Long,
Vadim Markovtsev,
Hugo Mougard,
Egor Bulychev,
Jan Hula
Abstract:
The way developers collaborate inside and particularly across teams often escapes management's attention, despite a formal organization with designated teams being defined. Observability of the actual, organically formed engineering structure provides decision makers invaluable additional tools to manage their talent pool. To identify existing inter and intra-team interactions - and suggest releva…
▽ More
The way developers collaborate inside and particularly across teams often escapes management's attention, despite a formal organization with designated teams being defined. Observability of the actual, organically formed engineering structure provides decision makers invaluable additional tools to manage their talent pool. To identify existing inter and intra-team interactions - and suggest relevant opportunities for suitable collaborations - this paper studies contributors' commit activity, usage of programming languages, and code identifier topics by embedding and clustering them. We evaluate our findings collaborating with the GitLab organization, analyzing 117 of their open source projects. We show that we are able to restore their engineering organization in broad strokes, and also reveal hidden coding collaborations as well as justify in-house technical decisions.
△ Less
Submitted 7 May, 2019;
originally announced May 2019.
-
STYLE-ANALYZER: fixing code style inconsistencies with interpretable unsupervised algorithms
Authors:
Vadim Markovtsev,
Waren Long,
Hugo Mougard,
Konstantin Slavnov,
Egor Bulychev
Abstract:
Source code reviews are manual, time-consuming, and expensive. Human involvement should be focused on analyzing the most relevant aspects of the program, such as logic and maintainability, rather than amending style, syntax, or formatting defects. Some tools with linting capabilities can format code automatically and report various stylistic violations for supported programming languages. They are…
▽ More
Source code reviews are manual, time-consuming, and expensive. Human involvement should be focused on analyzing the most relevant aspects of the program, such as logic and maintainability, rather than amending style, syntax, or formatting defects. Some tools with linting capabilities can format code automatically and report various stylistic violations for supported programming languages. They are based on rules written by domain experts, hence, their configuration is often tedious, and it is impractical for the given set of rules to cover all possible corner cases. Some machine learning-based solutions exist, but they remain uninterpretable black boxes. This paper introduces STYLE-ANALYZER, a new open source tool to automatically fix code formatting violations using the decision tree forest model which adapts to each codebase and is fully unsupervised. STYLE-ANALYZER is built on top of our novel assisted code review framework, Lookout. It accurately mines the formatting style of each analyzed Git repository and expresses the found format patterns with compact human-readable rules. STYLE-ANALYZER can then suggest style inconsistency fixes in the form of code review comments. We evaluate the output quality and practical relevance of STYLE-ANALYZER by demonstrating that it can reproduce the original style with high precision, measured on 19 popular JavaScript projects, and by showing that it yields promising results in fixing real style mistakes. STYLE-ANALYZER includes a web application to visualize how the rules are triggered. We release STYLE-ANALYZER as a reusable and extendable open source software package on GitHub for the benefit of the community.
△ Less
Submitted 1 April, 2019;
originally announced April 2019.
-
Splitting source code identifiers using Bidirectional LSTM Recurrent Neural Network
Authors:
Vadim Markovtsev,
Waren Long,
Egor Bulychev,
Romain Keramitas,
Konstantin Slavnov,
Gabor Markowski
Abstract:
Programmers make rich use of natural language in the source code they write through identifiers and comments. Source code identifiers are selected from a pool of tokens which are strongly related to the meaning, naming conventions, and context. These tokens are often combined to produce more precise and obvious designations. Such multi-part identifiers count for 97% of all naming tokens in the Pub…
▽ More
Programmers make rich use of natural language in the source code they write through identifiers and comments. Source code identifiers are selected from a pool of tokens which are strongly related to the meaning, naming conventions, and context. These tokens are often combined to produce more precise and obvious designations. Such multi-part identifiers count for 97% of all naming tokens in the Public Git Archive - the largest dataset of Git repositories to date. We introduce a bidirectional LSTM recurrent neural network to detect subtokens in source code identifiers. We trained that network on 41.7 million distinct splittable identifiers collected from 182,014 open source projects in Public Git Archive, and show that it outperforms several other machine learning models. The proposed network can be used to improve the upstream models which are based on source code identifiers, as well as improving developer experience allowing writing code without switching the keyboard case.
△ Less
Submitted 19 July, 2018; v1 submitted 26 May, 2018;
originally announced May 2018.