Mutual Information Scaling Law for Long-Context Language Modeling

ORAL

Abstract

We demonstrate that bipartite mutual information in natural language exhibits sub-volume law scaling, which contrasts with the logarithmic scaling observed in critical systems. This power-law growth reveals that multi-token correlations cannot be decomposed into two-point interactions, necessitating a many-body treatment. We derive a universal bound relating the long-context capability of large language models to the dimension of their history state—the latent variables that store past information. Just as entanglement scaling laws determine which tensor network ansätze can efficiently represent quantum many-body states, our bound establishes which neural architectures can capture the observed information scaling in sequential data. This yields a fundamental condition: effective sequence modeling requires the history state dimension to grow as a power law with sequence length. Transformer architectures naturally satisfy this condition through linearly growing key-value caches, while state-space models with fixed recurrent states require increasing model size. Our framework establishes information-theoretic limits for capturing long-range dependencies, providing concrete targets for efficient architecture design beyond the quadratic-complexity/fixed-state dichotomy.

*National Science Foundation (PHY-2019786, OAC-2320345, #2138259, #2138286, #2138307, #2137603, #2138296), U.S. Air Force Research Laboratory (FA8750-19-2-1000), Department of the Air Force Artificial Intelligence Accelerator, MathWorks Fellowship, and the National Artificial Intelligence Research Resource Pilot (NAIRR250043)

Publication: https://arxiv.org/abs/2503.04725
https://neurips.cc/virtual/2025/poster/115721

Presenters

  • Zhuo Chen

    • Massachusetts Institute of Technology

Authors

  • Zhuo Chen

    • Massachusetts Institute of Technology
  • Oriol Mayné i Comas

    • Massachusetts Institute of Technology
  • Zhuotao Jin

    • Massachusetts Institute of Technology
    • Harvard University
    • Harvard University, Massachusetts Institute of Technology
  • Di Luo

    • University of California, Los Angeles
  • Marin Soljacic

    • Massachusetts Institute of Technology
    • MIT