Skip to main content

Showing 1–14 of 14 results for author: Kuratov, Y

.
  1. arXiv:2506.05229  [pdf, ps, other

    cs.LG cs.CL

    Diagonal Batching Unlocks Parallelism in Recurrent Memory Transformers for Long Contexts

    Authors: Danil Sivtsov, Ivan Rodkin, Gleb Kuzmin, Yuri Kuratov, Ivan Oseledets

    Abstract: Transformer models struggle with long-context inference due to their quadratic time and linear memory complexity. Recurrent Memory Transformers (RMTs) offer a solution by reducing the asymptotic cost to linear time and constant memory usage. However, their memory update mechanism leads to sequential execution, causing a performance bottleneck. We introduce Diagonal Batching, a scheduling scheme… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

  2. arXiv:2502.13063  [pdf, ps, other

    cs.CL cs.LG

    Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

    Authors: Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev

    Abstract: A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches are focused on reduction of the amount of compute in existing language models rather than minimization of number of bits needed to store text. Despite relying on powerful models as enc… ▽ More

    Submitted 22 June, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

    Comments: ACL 2025 (main conference)

  3. arXiv:2501.13200  [pdf, other

    cs.LG cs.AI cs.MA

    SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

    Authors: Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev

    Abstract: Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents' behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers… ▽ More

    Submitted 22 January, 2025; originally announced January 2025.

    Comments: 16 pages, 11 figures

    ACM Class: I.2.11

  4. arXiv:2408.02439  [pdf, other

    cs.CL cs.AI

    Long Input Benchmark for Russian Analysis

    Authors: Igor Churin, Murat Apishev, Maria Tikhonova, Denis Shevelev, Aydar Bulatov, Yuri Kuratov, Sergej Averkiev, Alena Fenogenova

    Abstract: Recent advancements in Natural Language Processing (NLP) have fostered the development of Large Language Models (LLMs) that can solve an immense variety of tasks. One of the key aspects of their application is their ability to work with long text documents and to process long sequences of tokens. This has created a demand for proper evaluation of long-context understanding. To address this need fo… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

  5. arXiv:2407.04841  [pdf, other

    cs.CL cs.AI cs.LG

    Associative Recurrent Memory Transformer

    Authors: Ivan Rodkin, Yuri Kuratov, Aydar Bulatov, Mikhail Burtsev

    Abstract: This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is based on transformer self-attention for local context and segment-level recurrence for storage of task specific information distributed over a long context. We dem… ▽ More

    Submitted 13 February, 2025; v1 submitted 5 July, 2024; originally announced July 2024.

    Comments: ICML 2024 Next Generation of Sequence Modeling Architectures Workshop

    ACM Class: I.2.7

  6. arXiv:2406.10149  [pdf, other

    cs.CL cs.AI

    BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

    Authors: Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

    Abstract: In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long doc… ▽ More

    Submitted 6 November, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: NeurIPS 2024 Datasets and Benchmarks Track

  7. arXiv:2402.10790  [pdf, other

    cs.CL cs.AI cs.LG

    In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

    Authors: Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

    Abstract: This paper addresses the challenge of processing long documents using generative transformer models. To evaluate different approaches, we introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. Our evaluation, which includes benchmarks for GPT-4 and RAG, reveals that common methods are effective only for seque… ▽ More

    Submitted 20 February, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: 11M tokens, fix qa3 min facts per task in Table 1

  8. arXiv:2311.01326  [pdf, other

    cs.CL cs.AI

    Better Together: Enhancing Generative Knowledge Graph Completion with Language Models and Neighborhood Information

    Authors: Alla Chepurova, Aydar Bulatov, Yuri Kuratov, Mikhail Burtsev

    Abstract: Real-world Knowledge Graphs (KGs) often suffer from incompleteness, which limits their potential performance. Knowledge Graph Completion (KGC) techniques aim to address this issue. However, traditional KGC methods are computationally intensive and impractical for large-scale KGs, necessitating the learning of dense node embeddings and computing pairwise distances. Generative transformer-based lang… ▽ More

    Submitted 2 November, 2023; originally announced November 2023.

    Comments: Accepted to Findings of the Association for Computational Linguistics: EMNLP 2023

  9. arXiv:2304.11062  [pdf, other

    cs.CL cs.AI cs.LG

    Scaling Transformer to 1M tokens and beyond with RMT

    Authors: Aydar Bulatov, Yuri Kuratov, Yermek Kapushev, Mikhail S. Burtsev

    Abstract: A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer models to extend input context length while linearly scaling compute. Our approach demonstrates the capability to store information in memory for sequences of up… ▽ More

    Submitted 6 February, 2024; v1 submitted 19 April, 2023; originally announced April 2023.

  10. arXiv:2207.06881  [pdf, other

    cs.CL cs.LG

    Recurrent Memory Transformer

    Authors: Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

    Abstract: Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-… ▽ More

    Submitted 8 December, 2022; v1 submitted 14 July, 2022; originally announced July 2022.

    Comments: 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

  11. arXiv:2205.02340  [pdf, other

    cs.CL cs.LG

    Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

    Authors: Alina Kolesnikova, Yuri Kuratov, Vasily Konovalov, Mikhail Burtsev

    Abstract: Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden rep… ▽ More

    Submitted 4 May, 2022; originally announced May 2022.

  12. arXiv:2006.11527  [pdf, other

    cs.CL cs.LG cs.NE

    Memory Transformer

    Authors: Mikhail S. Burtsev, Yuri Kuratov, Anton Peganov, Grigory V. Sapunov

    Abstract: Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related… ▽ More

    Submitted 16 February, 2021; v1 submitted 20 June, 2020; originally announced June 2020.

  13. arXiv:2002.02450  [pdf, other

    cs.CL cs.LG stat.ML

    Goal-Oriented Multi-Task BERT-Based Dialogue State Tracker

    Authors: Pavel Gulyaev, Eugenia Elistratova, Vasily Konovalov, Yuri Kuratov, Leonid Pugachev, Mikhail Burtsev

    Abstract: Dialogue State Tracking (DST) is a core component of virtual assistants such as Alexa or Siri. To accomplish various tasks, these assistants need to support an increasing number of services and APIs. The Schema-Guided State Tracking track of the 8th Dialogue System Technology Challenge highlighted the DST problem for unseen services. The organizers introduced the Schema-Guided Dialogue (SGD) datas… ▽ More

    Submitted 5 February, 2020; originally announced February 2020.

  14. arXiv:1905.07213  [pdf, other

    cs.CL

    Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language

    Authors: Yuri Kuratov, Mikhail Arkhipov

    Abstract: The paper introduces methods of adaptation of multilingual masked language models for a specific language. Pre-trained bidirectional language models show state-of-the-art performance on a wide range of tasks including reading comprehension, natural language inference, and sentiment analysis. At the moment there are two alternative approaches to train such models: monolingual and multilingual. Whil… ▽ More

    Submitted 17 May, 2019; originally announced May 2019.