Mutual Information Scaling Law for Long-Context Language Modeling

Zhuo Chen; Oriol Mayné i Comas; Zhuotao Jin; Di Luo; Marin Soljacic

Mutual Information Scaling Law for Long-Context Language Modeling

Oral-In-person

Abstract

We demonstrate that bipartite mutual information in natural language exhibits sub-volume law scaling, which contrasts with the logarithmic scaling observed in critical systems. This power-law growth reveals that multi-token correlations cannot be decomposed into two-point interactions, necessitating a many-body treatment. We derive a universal bound relating the long-context capability of large language models to the dimension of their history state—the latent variables that store past information. Just as entanglement scaling laws determine which tensor network ansätze can efficiently represent quantum many-body states, our bound establishes which neural architectures can capture the observed information scaling in sequential data. This yields a fundamental condition: effective sequence modeling requires the history state dimension to grow as a power law with sequence length. Transformer architectures naturally satisfy this condition through linearly growing key-value caches, while state-space models with fixed recurrent states require increasing model size. Our framework establishes information-theoretic limits for capturing long-range dependencies, providing concrete targets for efficient architecture design beyond the quadratic-complexity/fixed-state dichotomy.

March 17, 2026, 10:12 AM – March 17, 2026, 10:24 AM

Publication: https://arxiv.org/abs/2503.04725
https://neurips.cc/virtual/2025/poster/115721

Presenters

Zhuo Chen
- Massachusetts Institute of Technology

Authors

Zhuo Chen
- Massachusetts Institute of Technology
Oriol Mayné i Comas
Zhuotao Jin
Di Luo
Marin Soljacic