Understanding Byte Pair Encoding: Part 2: Tokenization

Author

Mark Cassar

Published

December 23, 2024

Modified

January 10, 2025

Understanding Byte Pair Encoding: Part 2: Tokenization

In my last post, I discussed encoding text, specifically using UTF-8. As I noted there, this encoding uses 1 to 4 bytes to represent all the characters in the Unicode system. This will be the byte part of byte pair encoding (BPE), which was introduced by Sennrich, Haddow, and Birch in a paper entitled Neural Machine Translation of Rare Words with Subword Units in 2015 (although the official publication was in 2016). It was proposed as a solution to the open vocabulary problem in machine translation. The method is an adaptation of a data compression technique developed by Philip Gage back in 1994.

Rare and out-of-vocabulary (OOV) words are a known issue when dealing with language, so the proposal was to tokenize text in a way that allows for subword units. In this way, rare or OOV words could be composed of the subunits. Sennrich et al. showed that this approach gave better results than prior methods.

Now, instead of encodings and bytes I want to come to grips with the terms vocabulary, out-of-vocabulary, and tokenize: what do they mean and why do I need them?

I understand that large language models (LLMs) perform mathematical operations on their input to produce their output. So, it makes sense that they cannot process raw text. But, why not just use the numbers that come out of the UTF-8 encoding? The two main reasons against that approach that I can think of are:

  • encodings are about characters, not meaning; for example, ‘dog’ and ‘canine’ have similar meaning but very different encodings; likewise, ‘act’ and ‘cat’ would have similar encodings but have very different meanings; and
  • in order to process any text we would need about 1 million numbers, which would drastically increase the computation required (compared to current methods)

Tokenization

To get to a better numeric representation of text, the usual starting point is tokenization. Tokenization is the process of breaking up text into smaller pieces called tokens, where a token is the smallest unit of text. Once it’s decided what a token should be, the set of unique tokens represents the vocabulary. The idea then is that given any raw text, I can represent it as a sequence of tokens from the vocabulary.

The process so far is:

  • define what is meant by a token
  • gather text that can be used as training data
  • break the training data into tokens
  • create a vocabulary from the unique set of tokens
  • break up any new text into tokens from this vocabulary

It is a good time to try this out, so, I’ll grab some text that is in the public domain:

import requests

url = "https://www.gutenberg.org/cache/epub/2814/pg2814.txt"

response = requests.get(url)
text = response.text
text
text = text[1069:].replace("\r\n", " ").replace("\n", " ") # remove front matter and get rid of 'carriage' returns, ie. new lines
par = text[:903] # grab the first paragraph
par
'THE SISTERS   There was no hope for him this time: it was the third stroke. Night after night I had passed the house (it was vacation time) and studied the lighted square of window: and night after night I had found it lighted in the same way, faintly and evenly. If he was dead, I thought, I would see the reflection of candles on the darkened blind for I knew that two candles must be set at the head of a corpse. He had often said to me: “I am not long for this world,” and I had thought his words idle. Now I knew they were true. Every night as I gazed up at the window I said softly to myself the word paralysis. It had always sounded strangely in my ears, like the word gnomon in the Euclid and the word simony in the Catechism. But now it sounded to me like the name of some maleficent and sinful being. It filled me with fear, and yet I longed to be nearer to it and to look upon its deadly work'

I am going to ignore punctuation and define a token as a word:

import string 

punct = string.punctuation

par_no_punct = ''.join(c for c in par if c not in string.punctuation)
par_tokens = par_no_punct.split()  
print(len(par_tokens))
print(par_tokens)
183
['THE', 'SISTERS', 'There', 'was', 'no', 'hope', 'for', 'him', 'this', 'time', 'it', 'was', 'the', 'third', 'stroke', 'Night', 'after', 'night', 'I', 'had', 'passed', 'the', 'house', 'it', 'was', 'vacation', 'time', 'and', 'studied', 'the', 'lighted', 'square', 'of', 'window', 'and', 'night', 'after', 'night', 'I', 'had', 'found', 'it', 'lighted', 'in', 'the', 'same', 'way', 'faintly', 'and', 'evenly', 'If', 'he', 'was', 'dead', 'I', 'thought', 'I', 'would', 'see', 'the', 'reflection', 'of', 'candles', 'on', 'the', 'darkened', 'blind', 'for', 'I', 'knew', 'that', 'two', 'candles', 'must', 'be', 'set', 'at', 'the', 'head', 'of', 'a', 'corpse', 'He', 'had', 'often', 'said', 'to', 'me', '“I', 'am', 'not', 'long', 'for', 'this', 'world”', 'and', 'I', 'had', 'thought', 'his', 'words', 'idle', 'Now', 'I', 'knew', 'they', 'were', 'true', 'Every', 'night', 'as', 'I', 'gazed', 'up', 'at', 'the', 'window', 'I', 'said', 'softly', 'to', 'myself', 'the', 'word', 'paralysis', 'It', 'had', 'always', 'sounded', 'strangely', 'in', 'my', 'ears', 'like', 'the', 'word', 'gnomon', 'in', 'the', 'Euclid', 'and', 'the', 'word', 'simony', 'in', 'the', 'Catechism', 'But', 'now', 'it', 'sounded', 'to', 'me', 'like', 'the', 'name', 'of', 'some', 'maleficent', 'and', 'sinful', 'being', 'It', 'filled', 'me', 'with', 'fear', 'and', 'yet', 'I', 'longed', 'to', 'be', 'nearer', 'to', 'it', 'and', 'to', 'look', 'upon', 'its', 'deadly', 'work']

Based on defining a token at the word level, I have 183 tokens in my training data, which I took to be the first paragraph of the text. Now I would like to create the vocabulary, which is just the set of unique tokens:

vocab = sorted(list(set(par_tokens)))

print(len(vocab))
print(vocab)
109
['But', 'Catechism', 'Euclid', 'Every', 'He', 'I', 'If', 'It', 'Night', 'Now', 'SISTERS', 'THE', 'There', 'a', 'after', 'always', 'am', 'and', 'as', 'at', 'be', 'being', 'blind', 'candles', 'corpse', 'darkened', 'dead', 'deadly', 'ears', 'evenly', 'faintly', 'fear', 'filled', 'for', 'found', 'gazed', 'gnomon', 'had', 'he', 'head', 'him', 'his', 'hope', 'house', 'idle', 'in', 'it', 'its', 'knew', 'lighted', 'like', 'long', 'longed', 'look', 'maleficent', 'me', 'must', 'my', 'myself', 'name', 'nearer', 'night', 'no', 'not', 'now', 'of', 'often', 'on', 'paralysis', 'passed', 'reflection', 'said', 'same', 'see', 'set', 'simony', 'sinful', 'softly', 'some', 'sounded', 'square', 'strangely', 'stroke', 'studied', 'that', 'the', 'they', 'third', 'this', 'thought', 'time', 'to', 'true', 'two', 'up', 'upon', 'vacation', 'was', 'way', 'were', 'window', 'with', 'word', 'words', 'work', 'world”', 'would', 'yet', '“I']

Using this vocabulary, I can now try to tokenize any new text:

def check_tokens(tokens, vocab):
    for token in tokens:
        if token in vocab: 
            print(chr(10004), end=" ") #f"\tToken = '{token}' is in the vocabulary")
        else:
            print(f"({chr(10006)} '{token}')", end=" ")

new_text = 'I like to be on vacation'
new_text_tokens = new_text.split()

print(f"New text: {new_text}")
print(f"Possible tokenization: {new_text_tokens}")
print()
print("Check if all tokens are in the vocabulary:")
check_tokens(new_text_tokens, vocab)
New text: I like to be on vacation
Possible tokenization: ['I', 'like', 'to', 'be', 'on', 'vacation']

Check if all tokens are in the vocabulary:
✔ ✔ ✔ ✔ ✔ ✔ 

Out-of-vocabulary words

But what if I wanted to tokenize a different sentence?

diff_text = "I hate to be on vacation"
diff_text_tokens = diff_text.split()

print(f"Different text: {diff_text}")
print(f"Possible tokenization: {diff_text_tokens}")
print()
print("Check if all tokens are in the vocabulary:")
check_tokens(diff_text_tokens, vocab)
Different text: I hate to be on vacation
Possible tokenization: ['I', 'hate', 'to', 'be', 'on', 'vacation']

Check if all tokens are in the vocabulary:
✔ (✖ 'hate') ✔ ✔ ✔ ✔ 

I see I have run into the OOV problem! (I am tempted to say that no one hates to be on vacation so there really is no need to tokenize this sentence!) The text I am currently trying to tokenize contains a token (in this case, a word) that does not exist in my vocabulary. What happened?

Language is high-dimensional, that is, there are a lot of words. (And that is true just of English, not to mention all human languages.) Even though my example only considers a small amount of text as training data, to avoid this problem I would need to include all the text ever produced, which is not practical. Even that would only suffice until a new word was created and then I would be facing the OOV problem again. That is without even considering if that would be feasible computationally.

Rare words

Another issue arises if I try to make sure the vocabulary size doesn’t get too big. To do this, I can limit the vocabulary to tokens that occur at or above some threshold frequency. To see how this works, I’ll redo what I just did above, but this time I will only keep tokens in my vocabulary that occur more than once.

t_freq = {}
for token in par_tokens:
    t_freq[token] = t_freq.get(token, 0) + 1

tokens_freq = [token for token in t_freq.keys() if t_freq[token] > 1]
vocab_freq = sorted(list(set(tokens_freq)))
print(len(vocab_freq))
print(vocab_freq)
27
['I', 'It', 'after', 'and', 'at', 'be', 'candles', 'for', 'had', 'in', 'it', 'knew', 'lighted', 'like', 'me', 'night', 'of', 'said', 'sounded', 'the', 'this', 'thought', 'time', 'to', 'was', 'window', 'word']

Applying this new vocabulary to the same two sentences gives:

print(f"New text: {new_text}")
print(f"Possible tokenization: {new_text_tokens}")
print()
print("Check if all tokens are in the vocabulary:")
check_tokens(new_text_tokens, vocab_freq)
print()
print()
print(f"Different text: {diff_text}")
print(f"Possible tokenization: {diff_text_tokens}")
print()
print("Check if all tokens are in the vocabulary:")
check_tokens(diff_text_tokens, vocab_freq)
New text: I like to be on vacation
Possible tokenization: ['I', 'like', 'to', 'be', 'on', 'vacation']

Check if all tokens are in the vocabulary:
✔ ✔ ✔ ✔ (✖ 'on') (✖ 'vacation') 

Different text: I hate to be on vacation
Possible tokenization: ['I', 'hate', 'to', 'be', 'on', 'vacation']

Check if all tokens are in the vocabulary:
✔ (✖ 'hate') ✔ ✔ (✖ 'on') (✖ 'vacation') 

I get the anticipated drop in vocabulary size from 109 to 27, however, I also see that now neither of the sentences can be properly tokenized. This is the rare word problem.

One way to deal with this is to always add a special token <|unk|> that can be used whenever I encounter a token that is not in the vocabulary.

For our current scenario, our sentences would be tokenized as follows:

new_text_tokens = [token if token in vocab_freq else '<|unk|>' for token in new_text.split()]
diff_text_tokens = [token if token in vocab_freq else '<|unk|>' for token in diff_text.split()]

print(new_text_tokens)
print(diff_text_tokens)
['I', 'like', 'to', 'be', '<|unk|>', '<|unk|>']
['I', '<|unk|>', 'to', 'be', '<|unk|>', '<|unk|>']

This removes both the rare and OOV problems, but it is not ideal. Now, any not in my vocabulary gets assigned the exact same token, <|unk|>, and the model will treat them all identically regardless of their original meaning.

Go bigger

For current LLMs computation happens at the token level. As far as I can tell, this means that if a given piece of raw text can be represented by fewer tokens, then it requires less computation; or, if I fix the computational resources, then I can process more raw text at a time (the so-called context window).

If that is the case, maybe going bigger than word level would be a good idea. I’ll try sentences:

import re 

sent = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|!)\s', par)
sent

vocab_sent = []
for s in sent:
    cleaned_s = ''.join(ch for ch in s if ch.isalnum() or ch.isspace())
    vocab_sent.append(cleaned_s)

vocab_sent
['THE SISTERS   There was no hope for him this time it was the third stroke',
 'Night after night I had passed the house it was vacation time and studied the lighted square of window and night after night I had found it lighted in the same way faintly and evenly',
 'If he was dead I thought I would see the reflection of candles on the darkened blind for I knew that two candles must be set at the head of a corpse',
 'He had often said to me I am not long for this world and I had thought his words idle',
 'Now I knew they were true',
 'Every night as I gazed up at the window I said softly to myself the word paralysis',
 'It had always sounded strangely in my ears like the word gnomon in the Euclid and the word simony in the Catechism',
 'But now it sounded to me like the name of some maleficent and sinful being',
 'It filled me with fear and yet I longed to be nearer to it and to look upon its deadly work']

Now, if I try to tokenize new text I’ll see that it will be very unlikely that tokenization will occur properly:

print(f"New text: {new_text}")
new_text_tokens = [''.join(ch for ch in new_text if ch.isalnum() or ch.isspace())]
print(f"Possible tokenization: {new_text_tokens}")
print()
print("Check if all tokens are in the vocabulary:")
check_tokens(new_text_tokens, vocab_sent)
print()
print()
print(f"Different text: {diff_text}")
diff_text_tokens = [''.join(ch for ch in diff_text if ch.isalnum() or ch.isspace())]
print(f"Possible tokenization: {diff_text_tokens}")
print()
print("Check if all tokens are in the vocabulary:")
check_tokens(diff_text_tokens, vocab_sent)
New text: I like to be on vacation
Possible tokenization: ['I like to be on vacation']

Check if all tokens are in the vocabulary:
(✖ 'I like to be on vacation') 

Different text: I hate to be on vacation
Possible tokenization: ['I hate to be on vacation']

Check if all tokens are in the vocabulary:
(✖ 'I hate to be on vacation') 

If it were to work, then every sentence of raw text would only be one token, which would be computationally efficient. However, to get it to work, my training data would have to cover all possible sentences, which would make the vocabulary impractically large.

Go smaller

What if, instead of making the tokens longer than words, I made them shorter? An obvious approach then would be to tokenize at the character level:

vocab_ch = sorted(list(set(list(par))))
vocab_ch = [ch for ch in vocab_ch if ch.isalnum() or ch.isspace()]
print(vocab_ch)
[' ', 'B', 'C', 'E', 'H', 'I', 'N', 'R', 'S', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'y', 'z']

Let’s see how this would work on my two sentences:

print(f"New text: {new_text}")
new_text_tokens = list(''.join(ch for ch in new_text if ch.isalnum() or ch.isspace()))
print(f"Possible tokenization: {new_text_tokens}")
print()
print("Check if all tokens are in the vocabulary:")
check_tokens(new_text_tokens, vocab_ch)
print()
print()
print(f"Different text: {diff_text}")
diff_text_tokens = list(''.join(ch for ch in diff_text if ch.isalnum() or ch.isspace()))
print(f"Possible tokenization: {diff_text_tokens}")
print()
print("Check if all tokens are in the vocabulary:")
check_tokens(diff_text_tokens, vocab_ch)
New text: I like to be on vacation
Possible tokenization: ['I', ' ', 'l', 'i', 'k', 'e', ' ', 't', 'o', ' ', 'b', 'e', ' ', 'o', 'n', ' ', 'v', 'a', 'c', 'a', 't', 'i', 'o', 'n']

Check if all tokens are in the vocabulary:
✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 

Different text: I hate to be on vacation
Possible tokenization: ['I', ' ', 'h', 'a', 't', 'e', ' ', 't', 'o', ' ', 'b', 'e', ' ', 'o', 'n', ' ', 'v', 'a', 'c', 'a', 't', 'i', 'o', 'n']

Check if all tokens are in the vocabulary:
✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ 

It works! And since all English words, by definition, can be constructed from the alphabet, I will never encounter a word that I cannot tokenize.

Before celebrating, I will do some comparisons using the text “I like to be on vacation”.

Tokenization Level Number of Tokens Relative Vocabulary Size OOV/Rare Word Problems
character 24 small no
word 6 medium yes
sentence 1 large yes

I conclude that the “no free lunch” axiom is correct. Removing some problems seems to potentially introduce problems somewhere else. For instance, tokenizing at the character level means there would be no such thing as OOV or rare word problems, but the cost is that for any give text there is a large increase in the number of tokens. So, if token count represents, on some level, computation then we need more computation for the same text then, say, word level tokenization. If I keep the computation level fixed, then character level tokenization will significantly reduce the context window, that is, how much text can be processed at the same time by the LLM.

I think of byte pair encoding as a smart compromise between character and word level tokenization. And that is what I’ll finally dig into next time.

Support

If you enjoy this blog and would like to support my work, you can buy me a cup of coffee!