BPE Tokenizer Implementation
约 572 字大约 2 分钟
LLM
2025-10-22
Byte-Pair Encoding, or BPE, is one of the most widely used tokenizer families in modern language models. In this note, I walk through the core idea of BPE and the basic structure of a simple implementation.
Subword tokenization and the BPE idea
Deep learning models operate on tensors, not raw text. So before a model can process language, we need to turn text into integer sequences. This is the purpose of tokenization.
Before subword methods became dominant, two common strategies were:
- word-level tokenization: split on spaces; simple, but the vocabulary grows without bound and unknown words become a problem;
- character-level tokenization: split into characters; robust, but the sequence becomes too long and the semantic unit is often too small.
BPE finds a useful middle ground. It starts from basic symbols and repeatedly merges the most frequent adjacent pair. In this way it learns useful subword units from data.
That gives BPE several advantages:
- it controls vocabulary size;
- it balances vocabulary size and sequence length;
- it handles rare words and unseen words much more gracefully.
The core idea of the algorithm
The algorithm is simple:
- start from the smallest symbols;
- count adjacent symbol pairs;
- merge the most frequent pair;
- repeat until the target vocabulary size is reached.
A toy example
Suppose our tiny corpus is:
low low lower newest newest widest widest widestWe begin with symbols such as:
l, o, w, e, r, n, s, t, i, d, </w>Then we pre-tokenize the corpus into words and count their frequencies. After that we repeatedly merge frequent pairs:
- merge
("s", "t")into"st" - merge
("e", "st")into"est" - merge
("o", "w")into"ow"
and so on.
Implementation structure
In practice, a BPE tokenizer usually has three stages:
- training
- encoding
- decoding
1. Pre-tokenization
Before the BPE merge loop, it is common to use a regex-based pre-tokenizer. A widely used example is the GPT-2 style pattern:
PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""2. Training
The training loop can be organized like this:
from collections import Counter, defaultdict
def train_bpe_naive(text: str, num_merges: int):
vocab = {i: bytes([i]) for i in range(256)}
vocab[256] = "<|endoftext|>".encode("utf-8")
merges = []
freq_table = Counter(m.group() for m in re.finditer(PAT, text))
freq_table_tuple = {
tuple(bytes([x]) for x in key.encode("utf-8")): value
for key, value in freq_table.items()
}
for i in range(num_merges):
pair_stats = defaultdict(int)
for key, value in freq_table_tuple.items():
for j in range(len(key) - 1):
pair_stats[(key[j], key[j + 1])] += value
best_pair = max(pair_stats, key=lambda k: (pair_stats[k], k))
vocab[257 + i] = b"".join(best_pair)
freq_table_tuple = merge_pair_in_table(freq_table_tuple, best_pair)
merges.append(best_pair)
return vocab, mergesThe main steps are:
- initialize the vocabulary with all byte values;
- pre-tokenize the corpus;
- count adjacent pair frequencies;
- merge the best pair;
- update the vocabulary and the tokenized frequency table.
3. Encoding and decoding
Once training is done, encoding applies the learned merges in the learned order. Decoding maps token ids back to bytes and then decodes them into text.
Final remarks
The main point of BPE is not that the algorithm is complicated. It is the opposite:
BPE is powerful precisely because it turns tokenization into a simple data-driven compression-like process.
