-
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Authors:
Alejandro Hernández-Cano,
Alexander Hägele,
Allen Hao Huang,
Angelika Romanou,
Antoni-Joan Solergibert,
Barna Pasztor,
Bettina Messmer,
Dhia Garbaya,
Eduard Frank Ďurech,
Ido Hakimi,
Juan García Giraldo,
Mete Ismayilzada,
Negar Foroutan,
Skander Moalla,
Tiancheng Chen,
Vinko Sabolčec,
Yixuan Xu,
Michael Aerni,
Badr AlKhamissi,
Ines Altemir Marinas,
Mohammad Hossein Amani,
Matin Ansaripour,
Ilia Badanin,
Harold Benoit,
Emanuela Boros
, et al. (76 additional authors not shown)
Abstract:
We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively r…
▽ More
We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.
△ Less
Submitted 17 September, 2025;
originally announced September 2025.
-
RL for Reasoning by Adaptively Revealing Rationales
Authors:
Mohammad Hossein Amani,
Aryo Lotfi,
Nicolas Mario Baldwin,
Samy Bengio,
Mehrdad Farajtabar,
Emmanuel Abbe,
Robert West
Abstract:
We propose that reinforcement learning (RL) from partial expert demonstrations is not merely a training heuristic, but a promising framework for solving complex sequence generation tasks. Supervised fine-tuning (SFT) relies on dense ground-truth labels, which become increasingly costly as sequence length grows. RL, on the other hand, struggles with sparse rewards and a combinatorially large output…
▽ More
We propose that reinforcement learning (RL) from partial expert demonstrations is not merely a training heuristic, but a promising framework for solving complex sequence generation tasks. Supervised fine-tuning (SFT) relies on dense ground-truth labels, which become increasingly costly as sequence length grows. RL, on the other hand, struggles with sparse rewards and a combinatorially large output space. We address this by introducing adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals only a partial prefix of the target output during training. The supervision length is adjusted dynamically for each sample based on the model's past reward signal, allowing it to incrementally learn to complete reasoning chains by conditioning on correct partial solutions. We investigate this intermediate regime between SFT and RL and argue that per-sample curriculum learning is more than a trade-off between efficiency and generality, it can succeed in tasks with long sequences of latent dependencies where SFT and RL both fail to generalize. Using a synthetic task with latent parity constraints, we show that our adaptive curriculum over partial answers reliably solves problems that are otherwise intractable. On mathematical reasoning benchmarks (MATH, GSM8k), we find that curriculum learning enables models to solve problems that RL alone cannot, acquiring new reasoning capabilities through incremental exposure to partial solutions.
△ Less
Submitted 22 June, 2025;
originally announced June 2025.
-
Symbolic Autoencoding for Self-Supervised Sequence Learning
Authors:
Mohammad Hossein Amani,
Nicolas Mario Baldwin,
Amin Mansouri,
Martin Josifoski,
Maxime Peyrard,
Robert West
Abstract:
Traditional language models, adept at next-token prediction in text sequences, often struggle with transduction tasks between distinct symbolic systems, particularly when parallel data is scarce. Addressing this issue, we introduce \textit{symbolic autoencoding} ($Σ$AE), a self-supervised framework that harnesses the power of abundant unparallel data alongside limited parallel data. $Σ$AE connects…
▽ More
Traditional language models, adept at next-token prediction in text sequences, often struggle with transduction tasks between distinct symbolic systems, particularly when parallel data is scarce. Addressing this issue, we introduce \textit{symbolic autoencoding} ($Σ$AE), a self-supervised framework that harnesses the power of abundant unparallel data alongside limited parallel data. $Σ$AE connects two generative models via a discrete bottleneck layer and is optimized end-to-end by minimizing reconstruction loss (simultaneously with supervised loss for the parallel data), such that the sequence generated by the discrete bottleneck can be read out as the transduced input sequence. We also develop gradient-based methods allowing for efficient self-supervised sequence learning despite the discreteness of the bottleneck. Our results demonstrate that $Σ$AE significantly enhances performance on transduction tasks, even with minimal parallel data, offering a promising solution for weakly supervised learning scenarios.
△ Less
Submitted 16 February, 2024;
originally announced February 2024.
-
Memorization and Optimization in Deep Neural Networks with Minimum Over-parameterization
Authors:
Simone Bombari,
Mohammad Hossein Amani,
Marco Mondelli
Abstract:
The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks. A line of work has studied the NTK spectrum for two-layer and deep networks with at least a layer with $Ω(N)$ neurons, $N$ being the number of training samples. Furthermore, there is increasing evidence suggesting that deep networks with sub-li…
▽ More
The Neural Tangent Kernel (NTK) has emerged as a powerful tool to provide memorization, optimization and generalization guarantees in deep neural networks. A line of work has studied the NTK spectrum for two-layer and deep networks with at least a layer with $Ω(N)$ neurons, $N$ being the number of training samples. Furthermore, there is increasing evidence suggesting that deep networks with sub-linear layer widths are powerful memorizers and optimizers, as long as the number of parameters exceeds the number of samples. Thus, a natural open question is whether the NTK is well conditioned in such a challenging sub-linear setup. In this paper, we answer this question in the affirmative. Our key technical contribution is a lower bound on the smallest NTK eigenvalue for deep networks with the minimum possible over-parameterization: the number of parameters is roughly $Ω(N)$ and, hence, the number of neurons is as little as $Ω(\sqrt{N})$. To showcase the applicability of our NTK bounds, we provide two results concerning memorization capacity and optimization guarantees for gradient descent training.
△ Less
Submitted 21 May, 2023; v1 submitted 20 May, 2022;
originally announced May 2022.
-
Sharp asymptotics on the compression of two-layer neural networks
Authors:
Mohammad Hossein Amani,
Simone Bombari,
Marco Mondelli,
Rattana Pukdee,
Stefano Rini
Abstract:
In this paper, we study the compression of a target two-layer neural network with N nodes into a compressed network with M<N nodes. More precisely, we consider the setting in which the weights of the target network are i.i.d. sub-Gaussian, and we minimize the population L_2 loss between the outputs of the target and of the compressed network, under the assumption of Gaussian inputs. By using tools…
▽ More
In this paper, we study the compression of a target two-layer neural network with N nodes into a compressed network with M<N nodes. More precisely, we consider the setting in which the weights of the target network are i.i.d. sub-Gaussian, and we minimize the population L_2 loss between the outputs of the target and of the compressed network, under the assumption of Gaussian inputs. By using tools from high-dimensional probability, we show that this non-convex problem can be simplified when the target network is sufficiently over-parameterized, and provide the error rate of this approximation as a function of the input dimension and N. In this mean-field limit, the simplified objective, as well as the optimal weights of the compressed network, does not depend on the realization of the target network, but only on expected scaling factors. Furthermore, for networks with ReLU activation, we conjecture that the optimum of the simplified optimization problem is achieved by taking weights on the Equiangular Tight Frame (ETF), while the scaling of the weights and the orientation of the ETF depend on the parameters of the target network. Numerical evidence is provided to support this conjecture.
△ Less
Submitted 16 August, 2022; v1 submitted 17 May, 2022;
originally announced May 2022.