Evolution of Language Statistics under Renormalization Group Flow
ORAL
Abstract
The first step for language modelling is compressing a corpus into multi-character strings called tokens. Byte Pair Encoding (BPE), the most common tokenization algorithm, learns tokens by iteratively merging the most frequent bigram xy into a new token z. This recursive process produces a large vocabulary that encodes the n-gram statistics of the corpus. BPE uses and strongly influences the statistical properties of the tokenized language. To shed light on this process, we present a novel formalism to describe the evolution of bigram statistics during BPE. The approach treats BPE merge as a renormalization step, where the count of the most frequent bigram serves as a cutoff. First, we derive the exact update rules for how the bigram counts change at each BPE step. We apply a Markovian approximation to rewrite these update rules only in terms of bigrams and unigrams counts. These rules are used to construct an interaction Kernel that describes the complete change of the bigram distribution due to a BPE step. The formalism gives a non-linear flow equation that describes how the statistical structure of a language's bigrams renormalizes as the BPE cutoff is lowered. We explore the solutions of the equation in search of a fixed point and what it can tell us about real languages.
–
Presenters
-
Roberto E Avalos
- Emory University