July 11, 2025
In the last three months, we have had the joy of working with different groups of people exploring SuperBPE in their own model development pipelines. If you missed it, here is our first post introducing SuperBPE. In the process, we’ve compiled an FAQ with new findings and some practical suggestions to help practitioners use SuperBPE to its fullest potential.
While our original paper studied fairly short-context regimes (~4K token context length), the efficiency benefits of SuperBPE become even more pronounced in a long context regime. Specifically, for a fixed reduction factor in sequence length, the reduction in inference FLOPs grows quadratically with the length of text.
To see why this is true, recall that inference FLOPS are given by 4 × D × L2 + 40 × D2 × L
(assuming an MLP expansion of 8), where D = hidden dimension
and L = sequence length
.
The first and second terms represent the attention and non-attention FLOPs, respectively.
Now, let α denote the ratio of our tokenizer’s efficiency to that of a normal BPE tokenizer (e.g., α = 1.5
for our most efficient tokenizer).
If the context length is short, then the non-attention term dominates and inference compute savings are close to 1/α
.
However, in the long-context regime, attention dominates the inference compute and the compute savings approach 1/α2
!
To illustrate this, we plot the inference FLOPs used by a BPE and SuperBPE model to encode the same text with length measured in bytes. The gap between the blue and orange lines represents the FLOPs savings. For a BPE model with context size of 4K tokens, switching to SuperBPE gives a 35% reduction in inference FLOPs; for a BPE model with context size of 128K tokens, that reduction grows to 50%!
Scaling laws from the Chinchilla paper established that across different training budgets, the optimal way to allocate a fixed training budget is at ~22 training tokens per parameter. However, these experiments used a fixed subword tokenizer. As we begin to think of the tokenizer as another variable we can control, a natural question arises: does the compute-optimal ratio shift when using more efficient tokenizers? Intuitively, we would expect that “harder” (i.e. longer on average) tokens require a more capable (i.e. larger) model to predict well.
As an initial exploration in this direction, we fixed a training budget and performed a sweep over the ratio of tokens seen during training to model parameter count. In the below figure, we plot the (smoothed) endpoint loss of the training curves. All points in the plot use the same training budget, but points on the left represent larger models trained on fewer tokens, while points on the right are smaller models trained on more tokens. The minimum on each curve represents the optimal ratio of tokens (T) to parameters (P).
We see that the optimal T/P ratio for BPE is 22 tokens/parameter as expected, but for SuperBPE it is roughly 30% lower at 15 tokens/parameter — this is suspiciously close to the 30% average reduction in tokens due to SuperBPE’s improved efficiency! This points to the possibility that compute optimality is actually a constant ratio of training bytes to parameters, not training tokens to parameters as commonly perceived. From our experiments, the true compute-optimal ratio seems to be (a very nice) 100 bytes/parameter.
For model developers intending to train compute-optimal models, this would mean making the models bigger and shrinking the number of training tokens to achieve a ratio of 15 training tokens per model parameter. In this setting, the preceding figure suggests the SuperBPE model will achieve lower BPB while still retaining a small inference-time speedup.
In our original experiments, we adjusted the max context size of SuperBPE models (in tokens) to match the effective max context size of the BPE model in raw text (bytes). This is because we wanted to avoid an unfair advantage from SuperBPE seeing more textual context for the same next-token prediction. In new analysis, we support this design choice by showing that the longer the context in bytes (not tokens), the easier the next token is to predict. The following two plots show the average loss at every token index (left) vs byte index (right) — when measured at fixed token indices, SuperBPE has an advantage from seeing more context (achieving lower loss on average at the same token index), whereas at fixed byte indices, this advantage goes away.
Nonetheless, we wanted to understand how the max context size interacts with model performance. In addition to our original BPE (ctx=4096) and SuperBPE (ctx=3000) models, we train two additional ablations: BPE (ctx=3000) and SuperBPE (ctx=4096). All models share the same 8B architecture. In our setup, the global batch size (in number of training examples) is fixed, so models with shorter context sizes take more training steps. The four model settings are summarized below, with the middle two rows being from the original paper.
Tokenizer | Context size (tokens) | Effective context size (bytes) | Global batch size | Train steps |
---|---|---|---|---|
BPE | 3000 | 13,376 | 1024 | 107,982 |
BPE | 4096 | 18,262 | 1024 | 76,543 |
SuperBPE | 3000 | 18,268 | 1024 | 107,982 |
SuperBPE | 4096 | 24,938 | 1024 | 76,543 |
Shown below, we find that the two models with the shorter context size in tokens (regardless of the tokenizer) perform better! (Note that even when the BPE and SuperBPE models have equivalent performance, SuperBPE remains more efficient at inference time.) While this surprised us initially, it provides a somewhat satisfying answer to the question of why SuperBPE models performed better in our paper: SuperBPE enables a more optimal tradeoff between context size and training steps, without changing the actual effective context size. This relates to some existing work about the existence of a critical batch size that strikes the optimal balance between efficiency and performance.
For model developers, this means that to obtain improvements in model performance with SuperBPE, adjusting the context size is important. You can preserve the throughput by “rounding” the context length to a multiple of a power of 2 and increasing the microbatch size (since all the training examples are shorter, you can fit more per device). Fortunately, your effective context length is preserved! In general, this is the setting we recommend, in order to achieve gains in performance and inference-time efficiency simultaneously. However, if you are mainly interested in inference-time speedups from SuperBPE, then you can instead keep the same context tokens in tokens.
Here, we include some notes on training the SuperBPE tokenizer itself.
Model developers may have the tendency to prioritize “higher-quality” data for training, such as SFT or math data;
while this makes sense for model training, tokenizer training is simply about learning a broadly useful vocabulary, and we recommend against skewing tokenizer training data toward any particular domain.
In particular, training on data with templated phrases can lead to some unintended tokens. For instance, we’ve seen that when tokenizers are trained on a disproportionate amount of SFT data, canonicalized “AI assistant phrases” like Sure,␣I’d␣be␣happy
or ␣glad␣I␣could␣help
become single tokens. These tokens are rare for most of pretraining, so their embeddings may become undertrained in phase 1 of pretraining and difficult to learn in phase 2.
In our original paper, we used the same tokenizer training data for learning subwords (stage 1) and superwords (stage 2), but in general they do not need to be tied. You may decide, for instance, that you want most of the tokenizer to be multilingual, but have only English superwords. (Though we have found that SuperBPE generalizes well in multilingual settings.)
It’s even possible to extend an existing tokenizer by running stage 2 directly on it. This can be useful if you don’t have access to the training data for that tokenizer (perhaps because you borrowed an off-the-shelf option)!
Stage 1 (subword) and stage 2 (superword) of SuperBPE tokenizer training differ fundamentally in the pretokenization regex, with stage 2 being a more relaxed version that allows superwords. We recommend using the most advanced regex you have for stage 1, and keeping a subset of desirable regex for stage 2. For instance, in our original work, we kept the pretokenization scheme for digits in stage 2 to prevent arbitrarily long numbers from becoming a single token. You could also consider only allowing superwords consisting of sequences of complete words (see this brief discussion).
In our original paper, we found that the efficiency-optimal transition point is not necessarily the best for downstream performance. Indeed, predicting the performance of a tokenizer from intrinsic features is an unsolved problem [1, 2]. Nonetheless, it is useful to think of the best transition point in terms of distance from the final desired vocab size, with 10k or 20k from the end being a reliable heuristic.
Below, we plot the indices of token indices used in forming superwords for our tokenizer with vocab size = 200k and transition point = 180k.
We see that the subword tokens used are all learned very early in tokenizer training, which makes sense — common sequences of words are naturally composed of common words. After index 180k, superwords are composed further into larger superwords. Thus, in general, learning useful superwords does not depend on a very large subword vocabulary.
Sometimes, evaluation scripts make assumptions about the tokenizer that are very reasonable in the case of subword tokenization, but are untrue in the case of SuperBPE. Fortunately these are easy bugs to fix, but require some attention to detail to identify. Here are some examples we’ve noticed:
Suppose that for multiple choice problems we are comparing the logprobs of the tokens ␣A
, ␣B
, ␣C
, ␣D
.
However, what happens if ␣A\n
is a single token, and furthermore, in-context examples in the prompt suggest that a newline is expected after each answer choice?
The result is that very little probability will be placed on ␣A
, as that probability mass is instead on ␣A\n
.
To fix this, we would recommend identifying the right tokens to compare or decoding the answer option in a generative fashion.
In cloze-style Hellaswag evaluation, the log probabilities of multiple continuations are compared when conditioning on the same prompt. An issue arises when the prompt and continuation, when tokenized together, would result in a token that bridges the prompt-continuation boundary. Consider, for instance,
Prompt: One of the ping pongs lands in the cup and one of the boys begins to drink the beer. The group
Continuation: of teens is sailing down the river with others sailing in the background.
In a SuperBPE tokenizer, ␣group␣of
is usually a single token.
The result is that the SuperBPE model has never seen the token ␣group
followed by the token ␣of
in training, so it learns not to predict ␣of
when conditioned on ␣group
.
With our SuperBPE tokenizer, we have found that this issue affects 58% of Hellaswag prompt-continuation pairs.
To fix it, we recommend tokenizing the prompt and continuation together, and comparing those log probabilities instead.
That is, compare the log probs of P(prompt + conti)
instead of P(conti | prompt)
.
Suppose that in order to prompt the model to use chain of thought, the prompt ends with “Let's think step by step.”
While seemingly innocuous, this becomes an issue when the continuation starts with a common word like “The”
as “.␣The”
may be a single token in a SuperBPE tokenizer. There are multiple avenues for fixing this:
“Let's think step by step.”
is used in all the in-context examples, it is unnecessary to include in the final question,“step.”
is very unlikely to be a single token (it is uncommon), you can back up the prompt by one character to leave off the period, and“{punctuation mark}{letters}”
from being learned by using more sophisticated regex in Stage 2 (as discussed earlier). These are all instance of the prompt boundary problem [3, 4], which plagues all tokenizers and has been extensively studied. For subword tokenizers, we can avoid this problem in languages that use whitespace by ending our prompt with a complete word without the trailing whitespace; however, this heuristic becomes unreliable when tokens can be superwords.
It turns out that all of these issues can be solved by our new paper, which presents an efficient solution to the prompt boundary problem. We are working on integrating it into lm-evaluation-harness
, but we are not sure yet when that will roll out.
Overall, evaluation of language models is already notoriously tricky to get right, and it requires just a bit more attention to detail in the case of using a new type of tokenizer. We hope you find that it is worth it!