Juan Gastaldi
2023
A Formal Perspective on BytePair Encoding
Vilém Zouhar

Clara Meister

Juan Gastaldi

Li Du

Tim Vieira

Mrinmaya Sachan

Ryan Cotterell
Findings of the Association for Computational Linguistics: ACL 2023
BytePair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method.BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a 1/sigma*(1e(sigma))approximation of an optimal merge sequence, where sigma is the total backward curvature with respect to the optimal merge sequence. Empirically the lower bound of the approximation is approx0.37.We provide a faster implementation of BPE which improves the runtime complexity from O(NM) to O(N log M), where N is the sequence length and M is the merge count. Finally, we optimize the bruteforce algorithm for optimal BPE using memoization.
Tokenization and the Noiseless Channel
Vilém Zouhar

Clara Meister

Juan Gastaldi

Li Du

Mrinmaya Sachan

Ryan Cotterell
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Subword tokenization is a key part of most NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to improved downstream model performance over others. We propose that good tokenizers lead to efficient channel usage, where the channel is the means by which some input is conveyed to the model and efficiency can be quantified in informationtheoretic terms as the ratio of the Shannon entropy to the maximum entropy of the subword distribution. Nevertheless, an optimal encoding according to Shannon entropy assigns extremely long codes to lowfrequency subwords and very short codes to highfrequency subwords.Defining efficiency in terms of Rényi entropy, on the other hand, penalizes distributions with either very high or very lowfrequency subwords.We posit that (1) extremely highfrequency subwords are problematic because their meaning is not distinct and (2) that lowfrequency subwords may not appear frequently enough for their meaning to be learned properly; encodings that induce unigram distributions with either can harm model performance. In machine translation, we find that across multiple tokenizers, the Rényi entropy has a very strong correlation with BLEU: 0.82 in comparison to just 0.30 for compressed length.
Search
Coauthors
 Vilém Zouhar 2
 Clara Meister 2
 Li Du 2
 Mrinmaya Sachan 2
 Ryan Cotterell 2
 show all...