SuperBPE: Space Travel for Language Models

*Alisa Liu♥︎♠︎, *Jonathan Hayase♥︎, Valentin Hofmann♦︎♥︎, Sewoong Oh♥︎, Noah A. Smith♥︎♦︎, Yejin Choi♠︎
♥︎University of Washington, ♠︎NVIDIA, ♦︎Allen Institute for AI, *Equal contribution
TL;DR: We introduce a family of superword tokenizers that encode the same text using up to 33% fewer tokens than a BPE tokenizer of the same size. 8B models trained with our tokenizer are not only more efficient during inference, but also outperform the baseline on a large suite of downstream tasks, including +8.2% on MMLU.

Tokenizers are the interface through which language models interact with the world. Beginning from a blank slate, they learn the shape and form of language as sequences of tokens, becoming ever more capable. But in the end, the tokenizer will still govern the mapping between text and computation. Today, tokenization universally occurs at the level of subwords, meaning that tokens are parts of words (including complete words), but cannot bridge whitespace. Historically, subword tokenization was meant to combine the strengths of word-level tokenization, which cannot easily handle novel words, and byte-level tokenization, which can represent arbitrary text but is much less efficient.

In the era of transformer language models, tokenization is done at the level of subwords, meaning that tokens are parts of words (including complete words), but cannot bridge whitespace. Historically, subword tokenization was meant to combine the strengths of word-level tokenization, which cannot easily handle novel words, and byte-level tokenization, which can represent arbitrary text but is much less efficient.

But for modern language models, does it really make sense to limit tokens to parts of words? Whitespace is not a consistent delimiter of meaning — multi-word expressions (“by the way”) function semantically as single units, and different languages vary in the number of words needed to express a concept (“spacesuit helmet” is “Raumanzughelm” in German). At the extreme, languages such as Chinese do not use whitespace at all. Tokens in these languages span multiple words and even entire sentences, yet this has seemingly not hindered LMs from learning these languages.

We extend tokenization beyond subwords by introducing SuperBPE, an algorithm that produces tokenizers including both subword and “superword” tokens. As background, the subword restriction in BPE is enforced in a step called pretokenization, which splits the training text on whitespace to prevent common word sequences from becoming single tokens. SuperBPE modifies BPE by adding a simple pretokenization curriculum: the tokenizer first learns subwords by using pretokenization, and then lifts this restriction to transition to learning superwords.

Plot with vocabulary size on the x-axis and bytes per token on the y-axis. SuperBPE outperforms both variants of BPE.
SuperBPE encodes text more efficiently than BPE and the gap grows with vocabulary size!

We find that SuperBPE dramatically improves encoding efficiency over BPE, meaning that it segments the same piece of text into fewer tokens. This is because BPE tokenizers quickly exhaust the set of “useful” words to add to the vocabulary and begin adding increasingly rare (sub)words, which manifest as “undertrained tokens” like the famous _SolidGoldMagikarp. In contrast, SuperBPE can instead add common word sequences like _fish_oil to its vocabulary. For instance, at a fixed vocabulary size of 200k, a SuperBPE tokenizer uses 33% fewer tokens than BPE to encode the same amount of text!

What happens when we train models with superword tokenizers? In our experiments, we pretrain 8B models from scratch, fixing everything about the model architecture and training setup and varying only the algorithm for learning the vocabulary. We find that models trained with SuperBPE tokenizers are consistently better — our best model achieves a +4.0% absolute improvement on average over 30 downstream tasks over the BPE baseline, winning on 25/30 of the individual tasks (including +8.2% on MMLU), while also being 27% more efficient at inference time.

Category Task BPE SuperBPE Δ
Knowledge ARC-Easy (MC) 46.6 67.1 +20.5
ARC-Challenge (MC) 35.1 50.6 +15.5
Jeopardy (MC) 42.1 41.8 −0.3
MMLU (MC) 36.5 44.7 +8.2
OpenbookQA (MC) 33.2 54.4 +21.2
TriviaQA (EM) 60.6 61.3 +0.7
WikidataQA (EM) 69.7 70.9 +1.2
Math & Reasoning Arithmetic (EM) 54.8 59.3 +4.5
GSM8K (EM) 6.4 6.7 +0.3
LSAT-AR (MC) 21.3 23.0 +1.7
Operators (EM) 35.5 33.6 −1.9
Repeat-Copy-Logic (EM) 3.1 6.2 +3.1
Coding HumanEval (pass@10) 15.9 13.4 −2.5
MBPP (pass@10) 27.5 28.3 +0.8
Reading Comprehension BoolQ (MC) 59.7 64.6 +4.9
CoQA (EM) 12.6 13.2 +0.6
DROP (EM) 31.3 31.4 +0.1
HotpotQA (EM) 53.5 55.2 +1.7
SQuAD (EM) 75.1 75.8 +0.7
Commonsense CommonsenseQA (MC) 33.5 53.8 +20.3
COPA (MC) 77.0 85.8 +8.8
PIQA (MC) 55.2 59.8 +4.6
Winograd (MC) 50.4 53.1 +2.7
Winogrande (MC) 47.3 52.6 +5.3
Language Understanding HellaSwag (MC) 29.7 33.7 +4.0
LAMBADA (EM) 77.0 70.6 −6.4
Language Identification (EM) 8.8 9.0 +0.2
String Manipulation CS Algorithms (EM) 46.1 48.6 +2.5
CUTE (EM) 31.3 32.6 +1.3
Dyck-Languages (EM) 15.9 14.2 −1.7
Average 39.8 43.8 +4.0

We also find that SuperBPE distributes difficulty more uniformly over tokens, overfitting less to extremely common and easy-to-predict words while also achieving much lower loss on the hardest tokens. This makes sense from a qualitative linguistic analysis — superword tokens often consist of multi-word expressions (by accident, depend on, of course) that function semantically as single units. The individual words in these expressions are often semantically vacuous and have little variation in context. Under BPE these correspond to extremely low-loss tokens, while they are merged into larger superword tokens under SuperBPE.

Plot with vocabulary size on the x-axis and bytes per token on the y-axis. SuperBPE outperforms both variants of BPE.
SuperBPE has fewer tokens with very high and very low normalized loss!

Together, these findings suggest that SuperBPE provides a better representation of the text for learning language over, giving a remarkable boost to both encoding efficiency and downstream performance. SuperBPE replaces BPE tokenizers without requiring any other modifications to the architecture or training framework, making it a compelling alternative that seamlessly integrates with modern language model ecosystems. You can run our models right now using HuggingFace Transformers and vLLM!

If you made it this far, we encourage you to read our paper for lots more experiments, discussion, and analysis!

P.S. This blog post was 896 tokens when encoded by BPE, and 666 tokens when encoded by SuperBPE.

BibTeX

@misc{liu-etal-2025-superbpe,
      title={SuperBPE: Space Travel for Language Models},
      author={Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A. Smith and Yejin Choi},
      year={2025},
      eprint={2503.13423},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.13423},
}