Technology
Byte-Pair Encoding
Byte-Pair Encoding (BPE) is a subword tokenization algorithm: it iteratively merges the most frequent adjacent character or byte pairs in a corpus until a predefined vocabulary size is reached.
BPE originated in 1994 as a simple data compression technique, but its primary role now is efficient tokenization for neural language models. The algorithm starts with a base vocabulary—typically all single characters or 256 bytes—then greedily and iteratively merges the most frequent adjacent pair of symbols into a new, single subword token. This process continues until the target vocabulary size, often 50,000 to 100,000, is met. This subword approach is crucial: it allows models like GPT, RoBERTa, and BART to handle rare or unknown words (Out-of-Vocabulary or OOV terms) by breaking them down into known subword units, striking a balance between a small character-level vocabulary and a massive word-level one.
Related technologies
Recent Talks & Demos
Showing 1-1 of 1