Skip to main content

Showing 1–1 of 1 results for author: Wolleb, B

.
  1. arXiv:2306.01393  [pdf, other

    cs.CL

    Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT

    Authors: Benoist Wolleb, Romain Silvestri, Giorgos Vernikos, Ljiljana Dolamic, Andrei Popescu-Belis

    Abstract: Subword tokenization is the de facto standard for tokenization in neural language models and machine translation systems. Three advantages are frequently cited in favor of subwords: shorter encoding of frequent tokens, compositionality of subwords, and ability to deal with unknown words. As their relative importance is not entirely clear yet, we propose a tokenization approach that enables us to s… ▽ More

    Submitted 12 January, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: Accepted at EAMT 2023