Semantic Chunking and the Entropy of Natural Language

POSTER

Abstract

In 1951, Shannon estimated the entropy rate of English to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This estimate implies that English contains nearly 80% redundancy relative to random text. Despite its significance, no theoretical explanation for this striking observation has ever been established. Here, we propose a theoretical framework to explain this entropy rate from first principles. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-token level, reflecting the organization of meaningful information at different levels of abstraction. The correlations within the text can then be decomposed hierarchically, allowing for analytical treatment. Our theory reproduces Shannon’s classic estimate as a special case. To test our theory, we used LLMs to measure the entropy rate across open datasets of English corpora from diverse genres, ranging from children’s books to modern poetry. Surprisingly, we find that the entropy rate of language is not fixed but increases systematically with textual complexity, which is captured by the only free parameter in our model.

*The authors acknowledge funding from the Eric and Wendy Schmidt Fund, the Simons Foundation, and the NSF ACCESS Allocation CIS240836.

Presenters

  • Weishun Zhong

    • Institute for Advanced Study

Authors

  • Weishun Zhong

    • Institute for Advanced Study