Technology

Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a subword tokenization algorithm: it iteratively merges the most frequent adjacent character or byte pairs in a corpus until a predefined vocabulary size is reached.

BPE originated in 1994 as a simple data compression technique, but its primary role now is efficient tokenization for neural language models. The algorithm starts with a base vocabulary—typically all single characters or 256 bytes—then greedily and iteratively merges the most frequent adjacent pair of symbols into a new, single subword token. This process continues until the target vocabulary size, often 50,000 to 100,000, is met. This subword approach is crucial: it allows models like GPT, RoBERTa, and BART to handle rare or unknown words (Out-of-Vocabulary or OOV terms) by breaking them down into known subword units, striking a balance between a small character-level vocabulary and a massive word-level one.

https://huggingface.co/learn/nlp-course/chapter6/5

1 project · 1 city

Related technologies

BERT 179 BLOOM 115 Emergent Behaviors 1 GPT-3 191 GPT-4 528 Llama-2 227 PaLM 2 116 Prompt Delimiters 1 RoBERTa 118 Tokenization 2

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Glitch Tokens and Delimiters

Chicago Nov 8

GPT-4 Byte-Pair Encoding