-
Scaling Transformer to 1M tokens and beyond with RMT
Authors:
Aydar Bulatov,
Yuri Kuratov,
Yermek Kapushev,
Mikhail S. Burtsev
Abstract:
A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer models to extend input context length while linearly scaling compute. Our approach demonstrates the capability to store information in memory for sequences of up…
▽ More
A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer models to extend input context length while linearly scaling compute. Our approach demonstrates the capability to store information in memory for sequences of up to an unprecedented two million tokens while maintaining high retrieval accuracy. Experiments with language modeling tasks show perplexity improvement as the number of processed input segments increases. These results underscore the effectiveness of our method, which has significant potential to enhance long-term dependency handling in natural language understanding and generation tasks, as well as enable large-scale context processing for memory-intensive applications.
△ Less
Submitted 6 February, 2024; v1 submitted 19 April, 2023;
originally announced April 2023.
-
Recurrent Memory Transformer
Authors:
Aydar Bulatov,
Yuri Kuratov,
Mikhail S. Burtsev
Abstract:
Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-…
▽ More
Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention.
In this work, we propose and study a memory-augmented segment-level recurrent Transformer (RMT). Memory allows to store and process local and global information as well as to pass information between segments of the long sequence with the help of recurrence.
We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence. Then the model is trained to control both memory operations and sequence representations processing.
Results of experiments show that RMT performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing. We show that adding memory tokens to Tr-XL is able to improve its performance. This makes Recurrent Memory Transformer a promising architecture for applications that require learning of long-term dependencies and general purpose in memory processing, such as algorithmic tasks and reasoning.
△ Less
Submitted 8 December, 2022; v1 submitted 14 July, 2022;
originally announced July 2022.
-
Memory Transformer
Authors:
Mikhail S. Burtsev,
Yuri Kuratov,
Anton Peganov,
Grigory V. Sapunov
Abstract:
Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related…
▽ More
Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related to the sequence as a whole more difficult. Adding trainable memory to selectively store local as well as global representations of a sequence is a promising direction to improve the Transformer model. Memory-augmented neural networks (MANNs) extend traditional neural architectures with general-purpose memory for representations. MANNs have demonstrated the capability to learn simple algorithms like Copy or Reverse and can be successfully trained via backpropagation on diverse tasks from question answering to language modeling outperforming RNNs and LSTMs of comparable complexity. In this work, we propose and study few extensions of the Transformer baseline (1) by adding memory tokens to store non-local representations, (2) creating memory bottleneck for the global information, (3) controlling memory update with dedicated layer. We evaluate these memory augmented Transformers and demonstrate that presence of memory positively correlates with the model performance for machine translation and language modelling tasks. Augmentation of pre-trained masked language model with memory tokens shows mixed results for tasks from GLUE benchmark. Visualization of attention patterns over the memory suggest that it improves the model's ability to process a global context.
△ Less
Submitted 16 February, 2021; v1 submitted 20 June, 2020;
originally announced June 2020.
-
Continual and Multi-task Reinforcement Learning With Shared Episodic Memory
Authors:
Artyom Y. Sorokin,
Mikhail S. Burtsev
Abstract:
Episodic memory plays an important role in the behavior of animals and humans. It allows the accumulation of information about current state of the environment in a task-agnostic way. This episodic representation can be later accessed by down-stream tasks in order to make their execution more efficient. In this work, we introduce the neural architecture with shared episodic memory (SEM) for learni…
▽ More
Episodic memory plays an important role in the behavior of animals and humans. It allows the accumulation of information about current state of the environment in a task-agnostic way. This episodic representation can be later accessed by down-stream tasks in order to make their execution more efficient. In this work, we introduce the neural architecture with shared episodic memory (SEM) for learning and the sequential execution of multiple tasks. We explicitly split the encoding of episodic memory and task-specific memory into separate recurrent sub-networks. An agent augmented with SEM was able to effectively reuse episodic knowledge collected during other tasks to improve its policy on a current task in the Taxi problem. Repeated use of episodic representation in continual learning experiments facilitated acquisition of novel skills in the same environment.
△ Less
Submitted 7 May, 2019;
originally announced May 2019.
-
Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition
Authors:
L. T. Anh,
M. Y. Arkhipov,
M. S. Burtsev
Abstract:
Named Entity Recognition (NER) is one of the most common tasks of the natural language processing. The purpose of NER is to find and classify tokens in text documents into predefined categories called tags, such as person names, quantity expressions, percentage expressions, names of locations, organizations, as well as expression of time, currency and others. Although there is a number of approach…
▽ More
Named Entity Recognition (NER) is one of the most common tasks of the natural language processing. The purpose of NER is to find and classify tokens in text documents into predefined categories called tags, such as person names, quantity expressions, percentage expressions, names of locations, organizations, as well as expression of time, currency and others. Although there is a number of approaches have been proposed for this task in Russian language, it still has a substantial potential for the better solutions. In this work, we studied several deep neural network models starting from vanilla Bi-directional Long Short-Term Memory (Bi-LSTM) then supplementing it with Conditional Random Fields (CRF) as well as highway networks and finally adding external word embeddings. All models were evaluated across three datasets: Gareev's dataset, Person-1000, FactRuEval-2016. We found that extension of Bi-LSTM model with CRF significantly increased the quality of predictions. Encoding input tokens with external word embeddings reduced training time and allowed to achieve state of the art for the Russian NER task.
△ Less
Submitted 8 October, 2017; v1 submitted 27 September, 2017;
originally announced September 2017.
-
Alife Model of Evolutionary Emergence of Purposeful Adaptive Behavior
Authors:
Mikhail S. Burtsev,
Vladimir G. Redko,
Roman V. Gusarev
Abstract:
The process of evolutionary emergence of purposeful adaptive behavior is investigated by means of computer simulations. The model proposed implies that there is an evolving population of simple agents, which have two natural needs: energy and reproduction. Any need is characterized quantitatively by a corresponding motivation. Motivations determine goal-directed behavior of agents. The model dem…
▽ More
The process of evolutionary emergence of purposeful adaptive behavior is investigated by means of computer simulations. The model proposed implies that there is an evolving population of simple agents, which have two natural needs: energy and reproduction. Any need is characterized quantitatively by a corresponding motivation. Motivations determine goal-directed behavior of agents. The model demonstrates that purposeful behavior does emerge in the simulated evolutionary processes. Emergence of purposefulness is accompanied by origin of a simple hierarchy in the control system of agents.
△ Less
Submitted 8 October, 2001;
originally announced October 2001.