SuperBPE: Frequently Asked Questions

July 11, 2025

In the last three months, we have had the joy of working with different groups of people exploring SuperBPE in their own model development pipelines. If you missed it, here is our first post introducing SuperBPE. In the process, we’ve compiled an FAQ with new findings and some practical suggestions to help practitioners use SuperBPE to its fullest potential.

While our original paper studied fairly short-context regimes (~4K token context length), the efficiency benefits of SuperBPE become even more pronounced in a long context regime. Specifically, for a fixed reduction factor in sequence length, the reduction in inference FLOPs grows quadratically with the length of text.

To see why this is true, recall that inference FLOPS are given by 4 × D × L² + 40 × D² × L (assuming an MLP expansion of 8), where D = hidden dimension and L = sequence length. The first and second terms represent the attention and non-attention FLOPs, respectively. Now, let α denote the ratio of our tokenizer’s efficiency to that of a normal BPE tokenizer (e.g., α = 1.5 for our most efficient tokenizer). If the context length is short, then the non-attention term dominates and inference compute savings are close to 1/α. However, in the long-context regime, attention dominates the inference compute and the compute savings approach 1/α²!

To illustrate this, we plot the inference FLOPs used by a BPE and SuperBPE model to encode the same text with length measured in bytes. The gap between the blue and orange lines represents the FLOPs savings. For a BPE model with context size of 4K tokens, switching to SuperBPE gives a 35% reduction in inference FLOPs; for a BPE model with context size of 128K tokens, that reduction grows to 50%!

Plot with Sequence length in bytes on the x-axis and Inference FLOPs on the y-axis. The gap between BPE and SuperBPE grows quadratically with the sequence length. — Inference compute savings increase quadratically with context length.

Do SuperBPE models have the same compute-optimal token/parameter ratio?

Scaling laws from the Chinchilla paper established that across different training budgets, the optimal way to allocate a fixed training budget is at ~22 training tokens per parameter. However, these experiments used a fixed subword tokenizer. As we begin to think of the tokenizer as another variable we can control, a natural question arises: does the compute-optimal ratio shift when using more efficient tokenizers? Intuitively, we would expect that “harder” (i.e. longer on average) tokens require a more capable (i.e. larger) model to predict well.

As an initial exploration in this direction, we fixed a training budget and performed a sweep over the ratio of tokens seen during training to model parameter count. In the below figure, we plot the (smoothed) endpoint loss of the training curves. All points in the plot use the same training budget, but points on the left represent larger models trained on fewer tokens, while points on the right are smaller models trained on more tokens. The minimum on each curve represents the optimal ratio of tokens (T) to parameters (P).

Plot with token/parameter ratio x-axis and bits per byte on the y-axis. SuperBPE is minimized at T/P=15 while BPE is minimized at T/P=22. — SuperBPE's optimal token/parameter ratio is roughly 30% lower than BPE's!

We see that the optimal T/P ratio for BPE is 22 tokens/parameter as expected, but for SuperBPE it is roughly 30% lower at 15 tokens/parameter — this is suspiciously close to the 30% average reduction in tokens due to SuperBPE’s improved efficiency! This points to the possibility that compute optimality is actually a constant ratio of training bytes to parameters, not training tokens to parameters as commonly perceived. From our experiments, the true compute-optimal ratio seems to be (a very nice) 100 bytes/parameter.

For model developers intending to train compute-optimal models, this would mean making the models bigger and shrinking the number of training tokens to achieve a ratio of 15 training tokens per model parameter. In this setting, the preceding figure suggests the SuperBPE model will achieve lower BPB while still retaining a small inference-time speedup.

Should we adjust the context size for SuperBPE models?

In our original experiments, we adjusted the max context size of SuperBPE models (in tokens) to match the effective max context size of the BPE model in raw text (bytes). This is because we wanted to avoid an unfair advantage from SuperBPE seeing more textual context for the same next-token prediction. In new analysis, we support this design choice by showing that the longer the context in bytes (not tokens), the easier the next token is to predict. The following two plots show the average loss at every token index (left) vs byte index (right) — when measured at fixed token indices, SuperBPE has an advantage from seeing more context (achieving lower loss on average at the same token index), whereas at fixed byte indices, this advantage goes away.

Two plots with average bits per byte on the y-axis. (left) with tokens of context on the x-axis, SuperBPE has an advantage over BPE with equal tokens of context (right) with bytes of context on the x-axis, SuperBPE and BPE are essentially indistinguisable in BPB. — SuperBPE has an advantage over BPE when controlling for *tokens* of context,
which disappears if we control for *bytes* of context instead.

Nonetheless, we wanted to understand how the max context size interacts with model performance. In addition to our original BPE (ctx=4096) and SuperBPE (ctx=3000) models, we train two additional ablations: BPE (ctx=3000) and SuperBPE (ctx=4096). All models share the same 8B architecture. In our setup, the global batch size (in number of training examples) is fixed, so models with shorter context sizes take more training steps. The four model settings are summarized below, with the middle two rows being from the original paper.

Tokenizer	Context size (tokens)	Effective context size (bytes)	Global batch size	Train steps
BPE	3000	13,376	1024	107,982
BPE	4096	18,262	1024	76,543
SuperBPE	3000	18,268	1024	107,982
SuperBPE	4096	24,938	1024	76,543

Shown below, we find that the two models with the shorter context size in tokens (regardless of the tokenizer) perform better! (Note that even when the BPE and SuperBPE models have equivalent performance, SuperBPE remains more efficient at inference time.) While this surprised us initially, it provides a somewhat satisfying answer to the question of why SuperBPE models performed better in our paper: SuperBPE enables a more optimal tradeoff between context size and training steps, without changing the actual effective context size. This relates to some existing work about the existence of a critical batch size that strikes the optimal balance between efficiency and performance.

Plot with vocabulary size on the x-axis and bytes per token on the y-axis. SuperBPE outperforms both variants of BPE. — Models with shorter context size (in tokens) perform better,
but SuperBPE fits more text for the same number of tokens.

For model developers, this means that to obtain improvements in model performance with SuperBPE, adjusting the context size is important. You can preserve the throughput by “rounding” the context length to a multiple of a power of 2 and increasing the microbatch size (since all the training examples are shorter, you can fit more per device). Fortunately, your effective context length is preserved! In general, this is the setting we recommend, in order to achieve gains in performance and inference-time efficiency simultaneously. However, if you are mainly interested in inference-time speedups from SuperBPE, then you can instead keep the same context tokens in tokens.

How should I train my SuperBPE tokenizer?

Here, we include some notes on training the SuperBPE tokenizer itself.

1. Use phase 1 pretraining data for tokenizer training

Model developers may have the tendency to prioritize “higher-quality” data for training, such as SFT or math data; while this makes sense for model training, tokenizer training is simply about learning a broadly useful vocabulary, and we recommend against skewing tokenizer training data toward any particular domain. In particular, training on data with templated phrases can lead to some unintended tokens. For instance, we’ve seen that when tokenizers are trained on a disproportionate amount of SFT data, canonicalized “AI assistant phrases” like Sure,␣I’d␣be␣happy or ␣glad␣I␣could␣help become single tokens. These tokens are rare for most of pretraining, so their embeddings may become undertrained in phase 1 of pretraining and difficult to learn in phase 2.

2. Subwords and superwords can be learned over different data

In our original paper, we used the same tokenizer training data for learning subwords (stage 1) and superwords (stage 2), but in general they do not need to be tied. You may decide, for instance, that you want most of the tokenizer to be multilingual, but have only English superwords. (Though we have found that SuperBPE generalizes well in multilingual settings.)

It’s even possible to extend an existing tokenizer by running stage 2 directly on it. This can be useful if you don’t have access to the training data for that tokenizer (perhaps because you borrowed an off-the-shelf option)!

3. Play around with the pretokenization regex for subwords and superwords!

Stage 1 (subword) and stage 2 (superword) of SuperBPE tokenizer training differ fundamentally in the pretokenization regex, with stage 2 being a more relaxed version that allows superwords. We recommend using the most advanced regex you have for stage 1, and keeping a subset of desirable regex for stage 2. For instance, in our original work, we kept the pretokenization scheme for digits in stage 2 to prevent arbitrarily long numbers from becoming a single token. You could also consider only allowing superwords consisting of sequences of complete words (see this brief discussion).

4. Choose the transition point based on distance from final vocab size

In our original paper, we found that the efficiency-optimal transition point is not necessarily the best for downstream performance. Indeed, predicting the performance of a tokenizer from intrinsic features is an unsolved problem [1, 2]. Nonetheless, it is useful to think of the best transition point in terms of distance from the final desired vocab size, with 10k or 20k from the end being a reliable heuristic.

Below, we plot the indices of token indices used in forming superwords for our tokenizer with vocab size = 200k and transition point = 180k.

Histogram with token ID on the x-axis and count on the y-axis, showing the distribution of tokens used in superword merges. There are two spikes, one at the beginning of stage 1 and a smaller one at the beginning of stage 2. — Superwords are overwhelmingly built from early subwords (or early superwords).

We see that the subword tokens used are all learned very early in tokenizer training, which makes sense — common sequences of words are naturally composed of common words. After index 180k, superwords are composed further into larger superwords. Thus, in general, learning useful superwords does not depend on a very large subword vocabulary.

How should I eval my SuperBPE model?

Sometimes, evaluation scripts make assumptions about the tokenizer that are very reasonable in the case of subword tokenization, but are untrue in the case of SuperBPE. Fortunately these are easy bugs to fix, but require some attention to detail to identify. Here are some examples we’ve noticed:

Suppose that for multiple choice problems we are comparing the logprobs of the tokens ␣A, ␣B, ␣C, ␣D. However, what happens if ␣A\n is a single token, and furthermore, in-context examples in the prompt suggest that a newline is expected after each answer choice? The result is that very little probability will be placed on ␣A, as that probability mass is instead on ␣A\n. To fix this, we would recommend identifying the right tokens to compare or decoding the answer option in a generative fashion.
In cloze-style Hellaswag evaluation, the log probabilities of multiple continuations are compared when conditioning on the same prompt. An issue arises when the prompt and continuation, when tokenized together, would result in a token that bridges the prompt-continuation boundary. Consider, for instance,

Prompt: One of the ping pongs lands in the cup and one of the boys begins to drink the beer. The group
Continuation: of teens is sailing down the river with others sailing in the background.

In a SuperBPE tokenizer, ␣group␣of is usually a single token. The result is that the SuperBPE model has never seen the token ␣group followed by the token ␣of in training, so it learns not to predict ␣of when conditioned on ␣group.

With our SuperBPE tokenizer, we have found that this issue affects 58% of Hellaswag prompt-continuation pairs. To fix it, we recommend tokenizing the prompt and continuation together, and comparing those log probabilities instead. That is, compare the log probs of P(prompt + cont_i) instead of P(cont_i | prompt).
Suppose that in order to prompt the model to use chain of thought, the prompt ends with “Let's think step by step.” While seemingly innocuous, this becomes an issue when the continuation starts with a common word like “The” as “.␣The” may be a single token in a SuperBPE tokenizer. There are multiple avenues for fixing this:
1. If “Let's think step by step.” is used in all the in-context examples, it is unnecessary to include in the final question,
2. Since “step.” is very unlikely to be a single token (it is uncommon), you can back up the prompt by one character to leave off the period, and
3. You can consider preventing the more general category of tokens that have the form “{punctuation mark}{letters}” from being learned by using more sophisticated regex in Stage 2 (as discussed earlier).

These are all instance of the prompt boundary problem [3, 4], which plagues all tokenizers and has been extensively studied. For subword tokenizers, we can avoid this problem in languages that use whitespace by ending our prompt with a complete word without the trailing whitespace; however, this heuristic becomes unreliable when tokens can be superwords.

It turns out that all of these issues can be solved by our new paper, which presents an efficient solution to the prompt boundary problem. We are working on integrating it into lm-evaluation-harness, but we are not sure yet when that will roll out.

Overall, evaluation of language models is already notoriously tricky to get right, and it requires just a bit more attention to detail in the case of using a new type of tokenizer. We hope you find that it is worth it!