-
Long Code Arena: a Set of Benchmarks for Long-Context Code Models
Authors:
Egor Bogomolov,
Aleksandra Eliseeva,
Timur Galimzyanov,
Evgeniy Glukhov,
Anton Shapkin,
Maria Tigina,
Yaroslav Golubev,
Alexander Kovrigin,
Arie van Deursen,
Maliheh Izadi,
Timofey Bryksin
Abstract:
Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single m…
▽ More
Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Kotlin ML Pack: Technical Report
Authors:
Sergey Titov,
Mikhail Evtikhiev,
Anton Shapkin,
Oleg Smirnov,
Sergei Boytsov,
Sergei Boytsov,
Dariia Karaeva,
Maksim Sheptyakov,
Mikhail Arkhipov,
Timofey Bryksin,
Egor Bogomolov
Abstract:
In this technical report, we present three novel datasets of Kotlin code: KStack, KStack-clean, and KExercises. We also describe the results of fine-tuning CodeLlama and DeepSeek models on this data. Additionally, we present a version of the HumanEval benchmark rewritten by human experts into Kotlin - both the solutions and the tests. Our results demonstrate that small, high-quality datasets (KSta…
▽ More
In this technical report, we present three novel datasets of Kotlin code: KStack, KStack-clean, and KExercises. We also describe the results of fine-tuning CodeLlama and DeepSeek models on this data. Additionally, we present a version of the HumanEval benchmark rewritten by human experts into Kotlin - both the solutions and the tests. Our results demonstrate that small, high-quality datasets (KStack-clean and KExercises) can significantly improve model performance on code generation tasks, achieving up to a 16-point increase in pass rate on the HumanEval benchmark. Lastly, we discuss potential future work in the field of improving language modeling for Kotlin, including the use of static analysis tools in the learning process and the introduction of more intricate and realistic benchmarks.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Dynamic Retrieval-Augmented Generation
Authors:
Anton Shapkin,
Denis Litvinov,
Yaroslav Zharov,
Egor Bogomolov,
Timur Galimzyanov,
Timofey Bryksin
Abstract:
Current state-of-the-art large language models are effective in generating high-quality text and encapsulating a broad spectrum of world knowledge. These models, however, often hallucinate and lack locally relevant factual data. Retrieval-augmented approaches were introduced to overcome these problems and provide more accurate responses. Typically, the retrieved information is simply appended to t…
▽ More
Current state-of-the-art large language models are effective in generating high-quality text and encapsulating a broad spectrum of world knowledge. These models, however, often hallucinate and lack locally relevant factual data. Retrieval-augmented approaches were introduced to overcome these problems and provide more accurate responses. Typically, the retrieved information is simply appended to the main request, restricting the context window size of the model. We propose a novel approach for the Dynamic Retrieval-Augmented Generation (DRAG), based on the entity-augmented generation, which injects compressed embeddings of the retrieved entities into the generative model. The proposed pipeline was developed for code-generation tasks, yet can be transferred to some domains of natural language processing. To train the model, we collect and publish a new project-level code generation dataset. We use it for the evaluation along with publicly available datasets. Our approach achieves several targets: (1) lifting the length limitations of the context window, saving on the prompt size; (2) allowing huge expansion of the number of retrieval entities available for the context; (3) alleviating the problem of misspelling or failing to find relevant entity names. This allows the model to beat all baselines (except GPT-3.5) with a strong margin.
△ Less
Submitted 20 February, 2024; v1 submitted 14 December, 2023;
originally announced December 2023.
-
Recording and Reproduction of Pattern Memory Trace in EEG by Direct Electrical Stimulation of Brain Cortex
Authors:
A. G. Shapkin,
M. V. Taborov,
Yu. G. Shapkin
Abstract:
This study demonstrates the capability of external signal recording into memory and the reproduction of memory trace of this pattern in EEG by direct AC electrical stimulation of rat cerebral cortex. Additionally, we examine shifts of the DC potential level related to these phenomena. We show that in the course of memory trace reproduction, consecutive phases of engram activation and relaxation ar…
▽ More
This study demonstrates the capability of external signal recording into memory and the reproduction of memory trace of this pattern in EEG by direct AC electrical stimulation of rat cerebral cortex. Additionally, we examine shifts of the DC potential level related to these phenomena. We show that in the course of memory trace reproduction, consecutive phases of engram activation and relaxation are registered and accompanied by corresponding negative and positive DC shifts. The observed electrophysiological changes may reflect consecutive activation and inhibition phases of neural ensembles participating in engram formation.
△ Less
Submitted 22 November, 2011; v1 submitted 22 November, 2010;
originally announced November 2010.