Tokenization gauge symmetry in language modeling
ORAL
Abstract
Large language models (LLMs) use tokenizers to generate deterministic, canonical encodings of text into discrete tokens. However, one can generate several non-canonical encodings of a text using the same tokenizer. This introduces a gauge freedom in language modeling, as multiple tokenizations represent redundant descriptions of the same text. Training language models with only canonical encodings breaks this tokenization gauge symmetry. This symmetry breaking is a fundamental vulnerability of LLMs, with consequences ranging from silly mistakes (e.g. counting two r's in 'strawberry') to potentially serious pitfalls (e.g. bypassing safety guardrails). We quantify the resulting symmetry breaking using changes in cross-entropy loss under alternate tokenizations. Across several open-source LLMs, the dominant effect is an effective rescaling of the context-length dependence of the cross-entropy, governed by a temperature-dependent entropy rate. Consequently, the fractional change in loss follows a simple scaling law dependent on the relative compression ratio of token encodings. Notably, we find that increasing the sampling temperature approximately restores the broken gauge symmetry. Our work takes steps toward developing a rigorous understanding of tokenization gauge symmetry in LLMs, which is essential for developing safe and reliable AI systems.
–
Presenters
-
Kanishk Jain
- Emory University