Skip to main content

Showing 1–6 of 6 results for author: Schmidt, C W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.15889  [pdf, ps, other

    cs.CL

    Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

    Authors: Yifan Hu, Frank Liang, Dachuan Zhao, Jonathan Geuter, Varshini Reddy, Craig W. Schmidt, Chris Tanner

    Abstract: Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such as Chinese presents significant challenges, as its frequency-driven merge operation is agnostic to linguistic boundaries. To address this, we propose two entropy… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

  2. arXiv:2504.00178  [pdf, other

    cs.CL cs.AI

    Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier

    Authors: Craig W. Schmidt, Varshini Reddy, Chris Tanner, Yuval Pinter

    Abstract: Pre-tokenization, the initial step in many modern tokenization pipelines, segments text into smaller units called pretokens, typically splitting on whitespace and punctuation. While this process encourages having full, individual words as tokens, it introduces a fundamental limitation in most tokenization algorithms such as Byte Pair Encoding (BPE). Specifically, pre-tokenization causes the distri… ▽ More

    Submitted 31 March, 2025; originally announced April 2025.

    MSC Class: 68T50 ACM Class: I.2.7

  3. arXiv:2502.20273  [pdf, ps, other

    cs.CL cs.CE

    How Much is Enough? The Diminishing Returns of Tokenization Training Data

    Authors: Varshini Reddy, Craig W. Schmidt, Yuval Pinter, Chris Tanner

    Abstract: Tokenization, a crucial initial step in natural language processing, is governed by several key parameters, such as the tokenization algorithm, vocabulary size, pre-tokenization strategy, inference strategy, and training data corpus. This paper investigates the impact of an often-overlooked hyperparameter, tokenizer training data size. We train BPE, UnigramLM, and WordPiece tokenizers across vario… ▽ More

    Submitted 16 June, 2025; v1 submitted 27 February, 2025; originally announced February 2025.

  4. arXiv:2403.01289  [pdf, other

    cs.CL

    Greed is All You Need: An Evaluation of Tokenizer Inference Methods

    Authors: Omri Uzan, Craig W. Schmidt, Chris Tanner, Yuval Pinter

    Abstract: While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary siz… ▽ More

    Submitted 31 May, 2024; v1 submitted 2 March, 2024; originally announced March 2024.

    Comments: ACL 2024 (main)

  5. arXiv:2402.18376  [pdf, other

    cs.CL cs.AI

    Tokenization Is More Than Compression

    Authors: Craig W. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner

    Abstract: Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has been suggested that the effectiveness of BPE stems from its ability to condense text into a relatively small number of tokens. We test the hypothesis that fewer… ▽ More

    Submitted 7 October, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

    Comments: EMNLP 2024

    MSC Class: 68T50 ACM Class: I.2.7

  6. arXiv:1902.09875  [pdf, other

    cs.CL

    Improving a tf-idf weighted document vector embedding

    Authors: Craig W. Schmidt

    Abstract: We examine a number of methods to compute a dense vector embedding for a document in a corpus, given a set of word vectors such as those from word2vec or GloVe. We describe two methods that can improve upon a simple weighted sum, that are optimal in the sense that they maximizes a particular weighted cosine similarity measure. We consider several weighting functions, including inverse document f… ▽ More

    Submitted 26 February, 2019; originally announced February 2019.