Byte-Pair Encoding For Beginners

Author:Murphy | View: 23181 | Time: 2025-03-23 12:27:20

In this article, we look at one of most well-known tokenization algorithms called as Byte-Pair Encoding (BPE). It is used in many state of the art large language models such as BERT family, BART, and GPT family.

Let's get started.

Byte-Pair Encoding (BPE)

Byte-Pair Encoding (BPE) is a corpus-based subword tokenization algorithm. It is corpus based because it uses the training corpus to learn frequent characters (or symbols) and merge them together into one symbol. And it is a subword tokenizer because it breaks text into units smaller than (or equal to) words.

The image below, shows subword tokenization on the sentence " it is raining". Note while "it" and "is" are full word tokens; "rain" and "ing" are subwords from "raining".

BPE algorithm has two main parts: token learner, token segmenter.

1- Token learner: this takes a corpus of text and creates a vocabulary containing tokens. This corpus acts as the training corpus.

token learner takes a corpus of text and build a vocabulary – image by the author

2- Token segmenter: this takes a piece of text such as sentence and segment it into tokens. This text is the test data. We use the learning we gained from previous step to tokenize the test data in this step.

token segmenter converts a sentence to its tokens – image by the author

It is worth to mention that,

"Byte Pair Encoding (BPE) (Gage, 1994) is an old data compression technique that iteratively replaces the most frequent pair of bytes in a sequence with a single unused byte."[1]

The current BPE algorithm that we know for tokenization, adapts this algorithm but instead of merging frequent pairs of bytes, it merges frequent characters (or character sequences).

The complete pseudocode of the algorithm is below. Let's learn the algorithm step by step by following the pseudo code.

The first step is highlighted below in a red box: we create a vocab by looking at the corpus of documents we have, split them by character and add every character with its frequency of occurrence to the vocab.

We see that our corpus of documents contains [low, lower, newest, widest] each with different frequency. We use the to indicate end of the word.

Step one – image from [1] modified by author

In the second step, for _nummerge=10 times, we call three functions consecutively:

get_stats(): This function returns a dictionary (called pairs) of every character pair and their frequency of occurrence in the vocabulary.
max(): This gets the pairs returned by get_stats() and return the pair with maximum occurrence.
merge_vocab(): treats character pairs as one unit and replace every occurrence of character pair as one individual unit. It returns the updated vocabulary.

These three functions together get the most frequent character pairs in the corpus, merge them together into one symbol and update the vocabulary. We repeat this process for num_merges=10 times.

Step two— image from [1] modified by author

Let's work through the example above. Our corpus is as following:

We build our vocabulary by splitting all words into characters. We add _ at the end of the words to indicate end of the word before splitting them into characters. In the pseudocode, they used symbol to denote end of the word, but here we use – .

vocabulary from corpus – image by author

Next, we need to count frequency of each character in the corpus. To better represent the corpus with frequency of characters, we represent it as following:

Note "l o w – 5" mean each character of l, o, w and — are repeated 5 times in the context of the word "low-".

Next, we find out which two characters are next to each other most often? We see "e" and "r" are next to each other 8 times and that's the most. We merge them into one symbol "er", add it to vocabulary and update the corpus representation with the new symbol.

merging "e" and "r" into "er"— Image by author

We repeat the process: which two symbols are next to each other most often? we see "w" and "er" occur next to each other 8 times. so we merge them into one symbol "wer".

merging "w" and "er" into "wer" – image by author

We repeat the process: the next two symbols that occur the most together are "wer" and "—" . we merge them into "wer-" and update the corpus:

merging "wer" and "—" into "wer-" – Image by author

The next two symbols that occur most frequently are "l" and "o" that occur 7 times. We merge them together into "lo" and update the vocabulary.

merging "l" and "o" into "lo" – image by author

So far, we have done 4 merges. We will continue merging symbols together till we reach 10 merges. After that, our vocabulary is ready and we can use the segmenter algorithm of BPE to tokenize test data. This is how the segmenter algorithm works:

BPE segmenter algorithm: Run each merged learned from the training data, on the test data. These merged learned are symbols we earlier added to the vocabulary. Run them one by one on the test data in the order we learned them from the training data.

Note, frequency of character pairs do not matter here, and no additional learning is needed. Just scan the test data for the learned vocabulary entries and split the sequence.

Properties of BPE Tokens

The way BPE tokenizer works lead to having tokens that are often frequent words and frequent subwords. Frequent subwords are often morphemes __ like _es_t and er.

Morphemes are the smallest unit of language that has meaning. They give clues about meaning and grammar of the language.

It is important to note that the vocabulary size is configurable by controlling how many merge operations are performed. A smaller vocabulary results in more frequent subword units.

Summary

In conclusion, the Byte Pair Encoding (BPE) algorithm plays a pivotal role in modern NLP by efficiently breaking down text into subword units. Its token learner builds a subword vocabulary, while the token segmenter uses this vocabulary for text tokenization. Despite some drawbacks, BPE has proven to be highly effective in various NLP tasks, and it continues to be a fundamental component in many current Large Language Models like GPT-3, BERT, and RoBERTa.

If you have any questions or suggestions, feel free to reach out to me: Email: [email protected] LinkedIn: https://www.linkedin.com/in/minaghashami/