Understanding Byte Pair Encoding: Part 4: Nuances

Author

Mark Cassar

Published

January 11, 2025

Modified

January 12, 2025

Understanding Byte Pair Encoding: Part 4: Nuances

With the basics of the byte pair encoding (BPE) algorithm sorted out in my last post, I want to delve into some of the nuances of its application for GPT2.

Large language models (LLMs) operate at the token level. The set of unique tokens that the language model has available to it forms the vocabulary. And this vocabulary influences how the model understands language, as well as determining what it is able to predict, since a model like GPT2 generates output by predicting one token at a time.

Every language model also has a set size for its context window, meaning it can only handle a set number of tokens as input at a time. This creates a sort of pressure between getting as much text condensed into each token as possible and the model’s ability to handle new text, text that wasn’t seen during training. A further constraint is that the tokens should carry some meaning that is relevant to human language.

In Part 2 of this series, I noted that BPE was a smart compromise between character level and word level tokenization methods. What I mean by this is that BPE can take advantage of merging frequently occurring pairs of bytes so that a sufficient amount of data compression occurs when moving from text to tokens and its ability to handle rare and out-of-vocabulary (OOV) terms. However, the BPE algorithm does not have any sense of the meaning of language built into the process.

But, what could go wrong?

To answer that, I’ll take a look at what happens in a couple of scenarios. The example texts I use are indeed a bit contrived, but with a large corpus of training data, both cases are surely to arise numerous times.

Code
def get_stats(chars):
    stats = {}
    for i in range(len(chars)-1):
        stats[(chars[i], chars[i+1])] = stats.get((chars[i], chars[i+1]), 0) + 1 
    stats = dict(sorted(stats.items(), key=lambda item: item[1], reverse=True))
    stats_gt_1 = {k: v for k, v in stats.items() if v > 1}
    return stats_gt_1 
Code
def replace_pairs(text, pair):
  new_text = []
  i = 0
  while i < len(text):
    if text[i] == pair[0] and i < len(text) - 1 and text[i + 1] == pair[1]:
      new_text.append(pair[0]+ pair[1])
      i += 2  
    else:
      new_text.append(text[i])
      i += 1
  return new_text

The first example considers the merging of text and punctuation:

text = "my dog. your dog."
tokens = list(text)
print(f"Starting tokens: {tokens}")
print()

num_merges = 4
merges = {}

for _ in range(num_merges):
    stats = get_stats(tokens)
    pair = max(stats, key=stats.get)
    print(f"Merging pair: {pair}")
    tokens = replace_pairs(tokens, pair)
    print(f"--tokens after merging: {tokens}")
    print()
Starting tokens: ['m', 'y', ' ', 'd', 'o', 'g', '.', ' ', 'y', 'o', 'u', 'r', ' ', 'd', 'o', 'g', '.']

Merging pair: (' ', 'd')
--tokens after merging: ['m', 'y', ' d', 'o', 'g', '.', ' ', 'y', 'o', 'u', 'r', ' d', 'o', 'g', '.']

Merging pair: (' d', 'o')
--tokens after merging: ['m', 'y', ' do', 'g', '.', ' ', 'y', 'o', 'u', 'r', ' do', 'g', '.']

Merging pair: (' do', 'g')
--tokens after merging: ['m', 'y', ' dog', '.', ' ', 'y', 'o', 'u', 'r', ' dog', '.']

Merging pair: (' dog', '.')
--tokens after merging: ['m', 'y', ' dog.', ' ', 'y', 'o', 'u', 'r', ' dog.']

And the second example considers merges across word boundaries:

text = "one dog. one dinosaur. one dingo."
tokens = list(text)
print(f"Starting tokens: {tokens}")
print()

num_merges = 4
merges = {}

for _ in range(num_merges):
    stats = get_stats(tokens)
    pair = max(stats, key=stats.get)
    print(f"Merging pair: {pair}")
    tokens = replace_pairs(tokens, pair)
    print(f"--tokens after merging: {tokens}")
    print()
Starting tokens: ['o', 'n', 'e', ' ', 'd', 'o', 'g', '.', ' ', 'o', 'n', 'e', ' ', 'd', 'i', 'n', 'o', 's', 'a', 'u', 'r', '.', ' ', 'o', 'n', 'e', ' ', 'd', 'i', 'n', 'g', 'o', '.']

Merging pair: ('o', 'n')
--tokens after merging: ['on', 'e', ' ', 'd', 'o', 'g', '.', ' ', 'on', 'e', ' ', 'd', 'i', 'n', 'o', 's', 'a', 'u', 'r', '.', ' ', 'on', 'e', ' ', 'd', 'i', 'n', 'g', 'o', '.']

Merging pair: ('on', 'e')
--tokens after merging: ['one', ' ', 'd', 'o', 'g', '.', ' ', 'one', ' ', 'd', 'i', 'n', 'o', 's', 'a', 'u', 'r', '.', ' ', 'one', ' ', 'd', 'i', 'n', 'g', 'o', '.']

Merging pair: ('one', ' ')
--tokens after merging: ['one ', 'd', 'o', 'g', '.', ' ', 'one ', 'd', 'i', 'n', 'o', 's', 'a', 'u', 'r', '.', ' ', 'one ', 'd', 'i', 'n', 'g', 'o', '.']

Merging pair: ('one ', 'd')
--tokens after merging: ['one d', 'o', 'g', '.', ' ', 'one d', 'i', 'n', 'o', 's', 'a', 'u', 'r', '.', ' ', 'one d', 'i', 'n', 'g', 'o', '.']

I end up with the token dog. in the first example, and the token one d in the second. Both are not ideal.

Pearl & Grover

In the former case dog carries semantic meaning, while the . is syntax. With a lot of text I would end up with tokens like dog!, dog?, dog;, etc. This is not how language works: dog has the same meaning regardless of the surrounding punctuation. It would be better for the algorithm to just learn the token dog and leave punctuation as their own tokens.

In the latter case, I end up with a token that has crossed word boundaries. Again, this is not ideal, since the word one has meaning regardless of whether it is followed by a word starting with the letter d (that its meaning can be altered by the preceding or succeeding word is handled by the attention mechanism found in the transformer architecture, not through tokenization). This is also not how language works.

Of course, neither example would be problematic if only a handful of tokens were introduced that kind of break the rules of language; given, however, a large training corpus, these types of problems would crop up all the time.

To overcome these issues, as noted in the GPT2 paper, the authors prevented merging across “character categories”. To implement these categories, they used a regular expression. Applying this approach to my two examples gives:

import regex as re

categories = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")

example_1 = re.findall(categories, "my dog. your dog.")
example_2 = re.findall(categories, "one dog. one dinosaur. one dingo.")

print(example_1)
print(example_2)
['my', ' dog', '.', ' your', ' dog', '.']
['one', ' dog', '.', ' one', ' dinosaur', '.', ' one', ' dingo', '.']

Now the process would be to apply the BPE algorithm to each element of the list independently of all the others. In this way, merges could only occure across the defined categories.

Before finishing this topic, I want to summarize how this process works:

  • a completely different training data set than the one used to train the large language model (LLM) is used to train the BPE tokenizer
  • an initial vocabulary of 256 tokens (byte values from 0 to 255) starts the process
  • merges are done at the byte level and always for the most frequently occurring pair of adnacent bytes
  • merges do not occur across character category boundaries
  • merging stops when the vocabulary reaches a predetermined number of tokens (this is a hyperparameter)
  • after training there is:
    • a vocabulary: basically a dictionary of tokens and their unique IDs
    • a list of merge pairs: character pairs learned from training and ranked from most frequent to least frequent
  • the algorithm, vocabulary, and merge list are required at inference to tokenize text for input into an LLM for prediction

While the training code for the GPT2 tokenizer is not available, the code for tokenizing new text for input into the model is.

And that is all for this topic!

Support

If you enjoy this blog and would like to support my work, you can buy me a cup of coffee!