Search | arXiv e-print repository

Memento No More: Coaching AI Agents to Master Multiple Tasks via Hints Internalization

Authors: Minttu Alakuijala, Ya Gao, Georgy Ananov, Samuel Kaski, Pekka Marttinen, Alexander Ilin, Harri Valpola

Abstract: As the general capabilities of artificial intelligence (AI) agents continue to evolve, their ability to learn to master multiple complex tasks through experience remains a key challenge. Current LLM agents, particularly those based on proprietary language models, typically rely on prompts to incorporate knowledge about the target tasks. This approach does not allow the agent to internalize this in… ▽ More As the general capabilities of artificial intelligence (AI) agents continue to evolve, their ability to learn to master multiple complex tasks through experience remains a key challenge. Current LLM agents, particularly those based on proprietary language models, typically rely on prompts to incorporate knowledge about the target tasks. This approach does not allow the agent to internalize this information and instead relies on ever-expanding prompts to sustain its functionality in diverse scenarios. This resembles a system of notes used by a person affected by anterograde amnesia, the inability to form new memories. In this paper, we propose a novel method to train AI agents to incorporate knowledge and skills for multiple tasks without the need for either cumbersome note systems or prior high-quality demonstration data. Our approach employs an iterative process where the agent collects new experiences, receives corrective feedback from humans in the form of hints, and integrates this feedback into its weights via a context distillation training procedure. We demonstrate the efficacy of our approach by implementing it in a Llama-3-based agent that, after only a few rounds of feedback, outperforms advanced models GPT-4o and DeepSeek-V3 in tasksets requiring correct sequencing of information retrieval, tool use, and question answering. △ Less

Submitted 28 May, 2025; v1 submitted 3 February, 2025; originally announced February 2025.

arXiv:2412.14964 [pdf, other]

Knowledge Injection via Prompt Distillation

Authors: Kalle Kujanpää, Harri Valpola, Alexander Ilin

Abstract: In many practical applications, large language models (LLMs) need to incorporate new knowledge not present in their pre-training data. The primary methods for this are fine-tuning and retrieval-augmented generation (RAG). Although RAG has emerged as the industry standard for knowledge injection, fine-tuning has not yet achieved comparable success. In this paper, we propose a new fine-tuning techni… ▽ More In many practical applications, large language models (LLMs) need to incorporate new knowledge not present in their pre-training data. The primary methods for this are fine-tuning and retrieval-augmented generation (RAG). Although RAG has emerged as the industry standard for knowledge injection, fine-tuning has not yet achieved comparable success. In this paper, we propose a new fine-tuning technique for learning new knowledge and show that it can reach the performance of RAG. The proposed method is based on the self-distillation approach, which we call prompt distillation. First, we generate question-answer pairs about the new knowledge. Then, we fine-tune a student model on the question-answer pairs to imitate the output distributions of a teacher model, which additionally receives the new knowledge in its prompt. The student model is identical to the teacher, except it is equipped with a LoRA adapter. This training procedure facilitates distilling the new knowledge from the teacher's prompt into the student's weights. △ Less

Submitted 19 December, 2024; originally announced December 2024.

Comments: Preprint

arXiv:1903.11981 [pdf, other]

Regularizing Trajectory Optimization with Denoising Autoencoders

Authors: Rinu Boney, Norman Di Palo, Mathias Berglund, Alexander Ilin, Juho Kannala, Antti Rasmus, Harri Valpola

Abstract: Trajectory optimization using a learned model of the environment is one of the core elements of model-based reinforcement learning. This procedure often suffers from exploiting inaccuracies of the learned model. We propose to regularize trajectory optimization by means of a denoising autoencoder that is trained on the same trajectories as the model of the environment. We show that the proposed reg… ▽ More Trajectory optimization using a learned model of the environment is one of the core elements of model-based reinforcement learning. This procedure often suffers from exploiting inaccuracies of the learned model. We propose to regularize trajectory optimization by means of a denoising autoencoder that is trained on the same trajectories as the model of the environment. We show that the proposed regularization leads to improved planning with both gradient-based and gradient-free optimizers. We also demonstrate that using regularized trajectory optimization leads to rapid initial learning in a set of popular motor control tasks, which suggests that the proposed approach can be a useful tool for improving sample efficiency. △ Less

Submitted 25 December, 2019; v1 submitted 28 March, 2019; originally announced March 2019.

Comments: NeurIPS 2019

arXiv:1812.03955 [pdf, other]

Improving Model-Based Control and Active Exploration with Reconstruction Uncertainty Optimization

Authors: Norman Di Palo, Harri Valpola

Abstract: Model based predictions of future trajectories of a dynamical system often suffer from inaccuracies, forcing model based control algorithms to re-plan often, thus being computationally expensive, suboptimal and not reliable. In this work, we propose a model agnostic method for estimating the uncertainty of a model?s predictions based on reconstruction error, using it in control and exploration. As… ▽ More Model based predictions of future trajectories of a dynamical system often suffer from inaccuracies, forcing model based control algorithms to re-plan often, thus being computationally expensive, suboptimal and not reliable. In this work, we propose a model agnostic method for estimating the uncertainty of a model?s predictions based on reconstruction error, using it in control and exploration. As our experiments show, this uncertainty estimation can be used to improve control performance on a wide variety of environments by choosing predictions of which the model is confident. It can also be used for active learning to explore more efficiently the environment by planning for trajectories with high uncertainty, allowing faster model learning. △ Less

Submitted 10 December, 2018; originally announced December 2018.

arXiv:1709.02797 [pdf, ps, other]

On the exact relationship between the denoising function and the data distribution

Authors: Heikki Arponen, Matti Herranen, Harri Valpola

Abstract: We prove an exact relationship between the optimal denoising function and the data distribution in the case of additive Gaussian noise, showing that denoising implicitly models the structure of data allowing it to be exploited in the unsupervised learning of representations. This result generalizes a known relationship [2], which is valid only in the limit of small corruption noise. We prove an exact relationship between the optimal denoising function and the data distribution in the case of additive Gaussian noise, showing that denoising implicitly models the structure of data allowing it to be exploited in the unsupervised learning of representations. This result generalizes a known relationship [2], which is valid only in the limit of small corruption noise. △ Less

Submitted 6 September, 2017; originally announced September 2017.

Comments: 4 pages

arXiv:1707.09219 [pdf, other]

Recurrent Ladder Networks

Authors: Isabeau Prémont-Schwarz, Alexander Ilin, Tele Hotloo Hao, Antti Rasmus, Rinu Boney, Harri Valpola

Abstract: We propose a recurrent extension of the Ladder networks whose structure is motivated by the inference required in hierarchical latent variable models. We demonstrate that the recurrent Ladder is able to handle a wide variety of complex learning tasks that benefit from iterative inference and temporal modeling. The architecture shows close-to-optimal results on temporal modeling of video data, comp… ▽ More We propose a recurrent extension of the Ladder networks whose structure is motivated by the inference required in hierarchical latent variable models. We demonstrate that the recurrent Ladder is able to handle a wide variety of complex learning tasks that benefit from iterative inference and temporal modeling. The architecture shows close-to-optimal results on temporal modeling of video data, competitive results on music modeling, and improved perceptual grouping based on higher order abstractions, such as stochastic textures and motion cues. We present results for fully supervised, semi-supervised, and unsupervised tasks. The results suggest that the proposed architecture and principles are powerful tools for learning a hierarchy of abstractions, learning iterative inference and handling temporal information. △ Less

Submitted 18 December, 2017; v1 submitted 28 July, 2017; originally announced July 2017.

Comments: 9 pages, 9 figures, 7-page appendix, fixed fig 9 (c)

arXiv:1703.01780 [pdf, other]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Authors: Antti Tarvainen, Harri Valpola

Abstract: The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. It maintains an exponential moving average of label predictions on each training example, and penalizes predictions that are inconsistent with this target. However, because the targets change only once per epoch, Temporal Ensembling becomes unwieldy when learning large da… ▽ More The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. It maintains an exponential moving average of label predictions on each training example, and penalizes predictions that are inconsistent with this target. However, because the targets change only once per epoch, Temporal Ensembling becomes unwieldy when learning large datasets. To overcome this problem, we propose Mean Teacher, a method that averages model weights instead of label predictions. As an additional benefit, Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling. Without changing the network architecture, Mean Teacher achieves an error rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling trained with 1000 labels. We also show that a good network architecture is crucial to performance. Combining Mean Teacher and Residual Networks, we improve the state of the art on CIFAR-10 with 4000 labels from 10.55% to 6.28%, and on ImageNet 2012 with 10% of the labels from 35.24% to 9.11%. △ Less

Submitted 16 April, 2018; v1 submitted 6 March, 2017; originally announced March 2017.

Comments: In this version: Corrected hyperparameters of the 4000-label CIFAR-10 ResNet experiment. Changed Antti's contact info, Advances in Neural Information Processing Systems 30 (NIPS 2017) pre-proceedings

arXiv:1606.06724 [pdf, other]

Tagger: Deep Unsupervised Perceptual Grouping

Authors: Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hotloo Hao, Jürgen Schmidhuber, Harri Valpola

Abstract: We present a framework for efficient perceptual inference that explicitly reasons about the segmentation of its inputs and features. Rather than being trained for any specific segmentation, our framework learns the grouping process in an unsupervised manner or alongside any supervised task. By enriching the representations of a neural network, we enable it to group the representations of different… ▽ More We present a framework for efficient perceptual inference that explicitly reasons about the segmentation of its inputs and features. Rather than being trained for any specific segmentation, our framework learns the grouping process in an unsupervised manner or alongside any supervised task. By enriching the representations of a neural network, we enable it to group the representations of different objects in an iterative manner. By allowing the system to amortize the iterative inference of the groupings, we achieve very fast convergence. In contrast to many other recently proposed methods for addressing multi-object scenes, our system does not assume the inputs to be images and can therefore directly handle other modalities. For multi-digit classification of very cluttered images that require texture segmentation, our method offers improved classification performance over convolutional networks despite being fully connected. Furthermore, we observe that our system greatly improves on the semi-supervised result of a baseline Ladder network on our dataset, indicating that segmentation can also improve sample efficiency. △ Less

Submitted 28 November, 2016; v1 submitted 21 June, 2016; originally announced June 2016.

Comments: 14 pages + 5 pages supplementary, accepted at NIPS 2016

MSC Class: 97R40

arXiv:1507.02672 [pdf, other]

Semi-Supervised Learning with Ladder Networks

Authors: Antti Rasmus, Harri Valpola, Mikko Honkala, Mathias Berglund, Tapani Raiko

Abstract: We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the… ▽ More We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2015), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels. △ Less

Submitted 24 November, 2015; v1 submitted 9 July, 2015; originally announced July 2015.

Comments: Revised denoising function, updated results, fixed typos

arXiv:1504.08215 [pdf, other]

Lateral Connections in Denoising Autoencoders Support Supervised Learning

Authors: Antti Rasmus, Harri Valpola, Tapani Raiko

Abstract: We show how a deep denoising autoencoder with lateral connections can be used as an auxiliary unsupervised learning task to support supervised learning. The proposed model is trained to minimize simultaneously the sum of supervised and unsupervised cost functions by back-propagation, avoiding the need for layer-wise pretraining. It improves the state of the art significantly in the permutation-inv… ▽ More We show how a deep denoising autoencoder with lateral connections can be used as an auxiliary unsupervised learning task to support supervised learning. The proposed model is trained to minimize simultaneously the sum of supervised and unsupervised cost functions by back-propagation, avoiding the need for layer-wise pretraining. It improves the state of the art significantly in the permutation-invariant MNIST classification task. △ Less

Submitted 30 April, 2015; originally announced April 2015.

arXiv:1412.7210 [pdf, other]

Denoising autoencoder with modulated lateral connections learns invariant representations of natural images

Authors: Antti Rasmus, Tapani Raiko, Harri Valpola

Abstract: Suitable lateral connections between encoder and decoder are shown to allow higher layers of a denoising autoencoder (dAE) to focus on invariant representations. In regular autoencoders, detailed information needs to be carried through the highest layers but lateral connections from encoder to decoder relieve this pressure. It is shown that abstract invariant features can be translated to detailed… ▽ More Suitable lateral connections between encoder and decoder are shown to allow higher layers of a denoising autoencoder (dAE) to focus on invariant representations. In regular autoencoders, detailed information needs to be carried through the highest layers but lateral connections from encoder to decoder relieve this pressure. It is shown that abstract invariant features can be translated to detailed reconstructions when invariant features are allowed to modulate the strength of the lateral connection. Three dAE structures with modulated and additive lateral connections, and without lateral connections were compared in experiments using real-world images. The experiments verify that adding modulated lateral connections to the model 1) improves the accuracy of the probability model for inputs, as measured by denoising performance; 2) results in representations whose degree of invariance grows faster towards the higher layers; and 3) supports the formation of diverse invariant poolings. △ Less

Submitted 31 March, 2015; v1 submitted 22 December, 2014; originally announced December 2014.

Comments: Presentation at ICLR 2015 workshop

arXiv:1411.7783 [pdf, ps, other]

From neural PCA to deep unsupervised learning

Authors: Harri Valpola

Abstract: A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochast… ▽ More A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments. △ Less

Submitted 2 February, 2015; v1 submitted 28 November, 2014; originally announced November 2014.

Comments: A revised version of an article that has been accepted for publication in Advances in Independent Component Analysis and Learning Machines (2015), edited by Ella Bingham, Samuel Kaski, Jorma Laaksonen and Jouko Lampinen

arXiv:1301.3476 [pdf, other]

Pushing Stochastic Gradient towards Second-Order Methods -- Backpropagation Learning with Transformations in Nonlinearities

Authors: Tommi Vatanen, Tapani Raiko, Harri Valpola, Yann LeCun

Abstract: Recently, we proposed to transform the outputs of each hidden neuron in a multi-layer perceptron network to have zero output and zero slope on average, and use separate shortcut connections to model the linear dependencies instead. We continue the work by firstly introducing a third transformation to normalize the scale of the outputs of each hidden neuron, and secondly by analyzing the connection… ▽ More Recently, we proposed to transform the outputs of each hidden neuron in a multi-layer perceptron network to have zero output and zero slope on average, and use separate shortcut connections to model the linear dependencies instead. We continue the work by firstly introducing a third transformation to normalize the scale of the outputs of each hidden neuron, and secondly by analyzing the connections to second order optimization methods. We show that the transformations make a simple stochastic gradient behave closer to second-order optimization methods and thus speed up learning. This is shown both in theory and with experiments. The experiments on the third transformation show that while it further increases the speed of learning, it can also hurt performance by converging to a worse local optimum, where both the inputs and outputs of many hidden neurons are close to zero. △ Less

Submitted 11 March, 2013; v1 submitted 15 January, 2013; originally announced January 2013.

Comments: 10 pages, 5 figures, ICLR2013

arXiv:1207.1380 [pdf]

Bayes Blocks: An Implementation of the Variational Bayesian Building Blocks Framework

Authors: Markus Harva, Tapani Raiko, Antti Honkela, Harri Valpola, Juha Karhunen

Abstract: A software library for constructing and learning probabilistic models is presented. The library offers a set of building blocks from which a large variety of static and dynamic models can be built. These include hierarchical models for variances of other variables and many nonlinear models. The underlying variational Bayesian machinery, providing for fast and robust estimation but being mathematic… ▽ More A software library for constructing and learning probabilistic models is presented. The library offers a set of building blocks from which a large variety of static and dynamic models can be built. These include hierarchical models for variances of other variables and many nonlinear models. The underlying variational Bayesian machinery, providing for fast and robust estimation but being mathematically rather involved, is almost completely hidden from the user thus making it very easy to use the library. The building blocks include Gaussian, rectified Gaussian and mixture-of-Gaussians variables and computational nodes which can be combined rather freely. △ Less

Submitted 4 July, 2012; originally announced July 2012.

Comments: Appears in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005)

Report number: UAI-P-2005-PG-259-266

Showing 1–14 of 14 results for author: Valpola, H