Search | arXiv e-print repository

Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

Authors: Yifan Hu, Frank Liang, Dachuan Zhao, Jonathan Geuter, Varshini Reddy, Craig W. Schmidt, Chris Tanner

Abstract: Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such as Chinese presents significant challenges, as its frequency-driven merge operation is agnostic to linguistic boundaries. To address this, we propose two entropy… ▽ More Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such as Chinese presents significant challenges, as its frequency-driven merge operation is agnostic to linguistic boundaries. To address this, we propose two entropy-informed pre-tokenization strategies that guide BPE segmentation using unsupervised information-theoretic cues. The first approach uses pointwise mutual information and left/right entropy to identify coherent character spans, while the second leverages predictive entropy derived from a pretrained GPT-2 model to detect boundary uncertainty. We evaluate both methods on a subset of the PKU dataset and demonstrate substantial improvements in segmentation precision, recall, and F1 score compared to standard BPE. Our results suggest that entropy-guided pre-tokenization not only enhances alignment with gold-standard linguistic units but also offers a promising direction for improving tokenization quality in low-resource and multilingual settings. △ Less

Submitted 18 June, 2025; originally announced June 2025.

arXiv:2506.04118 [pdf, ps, other]

Guided Speculative Inference for Efficient Test-Time Alignment of LLMs

Authors: Jonathan Geuter, Youssef Mroueh, David Alvarez-Melis

Abstract: We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $π_S(y\mid x)$. We provably approximate the optimal tilted policy $π_{β,B}(y\mid x) \propto π_B(y\mid x)\exp(β\,r(x,y))$ of soft best-of-$n$ u… ▽ More We propose Guided Speculative Inference (GSI), a novel algorithm for efficient reward-guided decoding in large language models. GSI combines soft best-of-$n$ test-time scaling with a reward model $r(x,y)$ and speculative samples from a small auxiliary model $π_S(y\mid x)$. We provably approximate the optimal tilted policy $π_{β,B}(y\mid x) \propto π_B(y\mid x)\exp(β\,r(x,y))$ of soft best-of-$n$ under the primary model $π_B$. We derive a theoretical bound on the KL divergence between our induced distribution and the optimal policy. In experiments on reasoning benchmarks (MATH500, OlympiadBench, Minerva Math), our method achieves higher accuracy than standard soft best-of-$n$ with $π_S$ and reward-guided speculative decoding (Liao et al., 2025), and in certain settings even outperforms soft best-of-$n$ with $π_B$. The code is available at https://github.com/j-geuter/GSI . △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: 12 pages, 2 figures

ACM Class: I.2.7

arXiv:2503.01140 [pdf, other]

DDEQs: Distributional Deep Equilibrium Models through Wasserstein Gradient Flows

Authors: Jonathan Geuter, Clément Bonet, Anna Korba, David Alvarez-Melis

Abstract: Deep Equilibrium Models (DEQs) are a class of implicit neural networks that solve for a fixed point of a neural network in their forward pass. Traditionally, DEQs take sequences as inputs, but have since been applied to a variety of data. In this work, we present Distributional Deep Equilibrium Models (DDEQs), extending DEQs to discrete measure inputs, such as sets or point clouds. We provide a th… ▽ More Deep Equilibrium Models (DEQs) are a class of implicit neural networks that solve for a fixed point of a neural network in their forward pass. Traditionally, DEQs take sequences as inputs, but have since been applied to a variety of data. In this work, we present Distributional Deep Equilibrium Models (DDEQs), extending DEQs to discrete measure inputs, such as sets or point clouds. We provide a theoretically grounded framework for DDEQs. Leveraging Wasserstein gradient flows, we show how the forward pass of the DEQ can be adapted to find fixed points of discrete measures under permutation-invariance, and derive adequate network architectures for DDEQs. In experiments, we show that they can compete with state-of-the-art models in tasks such as point cloud classification and point cloud completion, while being significantly more parameter-efficient. △ Less

Submitted 22 March, 2025; v1 submitted 2 March, 2025; originally announced March 2025.

Comments: 39 pages, 17 figures. To be published in AISTATS 2025

arXiv:2307.11224 [pdf, other]

Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models

Authors: Michael Günther, Louis Milliken, Jonathan Geuter, Georgios Mastrapas, Bo Wang, Han Xiao

Abstract: Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating textual inputs into numerical representations, capturing the semantics of the text. These models excel in applications like dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet dataset… ▽ More Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating textual inputs into numerical representations, capturing the semantics of the text. These models excel in applications like dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets. It underlines the crucial role of data cleaning in dataset preparation, offers in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Text Embedding Benchmark (MTEB). Furthermore, to increase the model's awareness of grammatical negation, we construct a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community. △ Less

Submitted 20 October, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

Comments: 9 pages, 2 page appendix

MSC Class: 68T50 ACM Class: H.3.1; H.3.3; I.2.7; I.5.4

arXiv:2212.00133 [pdf, ps, other]

Universal Neural Optimal Transport

Authors: Jonathan Geuter, Gregor Kornhardt, Ingimar Tomasson, Vaios Laschos

Abstract: Optimal Transport (OT) problems are a cornerstone of many applications, but solving them is computationally expensive. To address this problem, we propose UNOT (Universal Neural Optimal Transport), a novel framework capable of accurately predicting (entropic) OT distances and plans between discrete measures for a given cost function. UNOT builds on Fourier Neural Operators, a universal class of ne… ▽ More Optimal Transport (OT) problems are a cornerstone of many applications, but solving them is computationally expensive. To address this problem, we propose UNOT (Universal Neural Optimal Transport), a novel framework capable of accurately predicting (entropic) OT distances and plans between discrete measures for a given cost function. UNOT builds on Fourier Neural Operators, a universal class of neural networks that map between function spaces and that are discretization-invariant, which enables our network to process measures of variable resolutions. The network is trained adversarially using a second, generating network and a self-supervised bootstrapping loss. We ground UNOT in an extensive theoretical framework. Through experiments on Euclidean and non-Euclidean domains, we show that our network not only accurately predicts OT distances and plans across a wide range of datasets, but also captures the geometry of the Wasserstein space correctly. Furthermore, we show that our network can be used as a state-of-the-art initialization for the Sinkhorn algorithm with speedups of up to $7.4\times$, significantly outperforming existing approaches. △ Less

Submitted 12 June, 2025; v1 submitted 30 November, 2022; originally announced December 2022.

Comments: 37 pages, 19 figures, accepted to ICML 2025

MSC Class: 68T07 (Primary) 90C08 (Secondary) ACM Class: I.2.6; G.3; G.4

Showing 1–5 of 5 results for author: Geuter, J