Is Grokking a Computational Glass Relaxation?

Entao Yang; Xiaotian Zhang; Yue Shang; Ge Zhang

Is Grokking a Computational Glass Relaxation?

Oral-In-person

Abstract

Grokking, the surprising phenomenon where neural networks (NNs) suddenly generalize long after achieving (almost) perfect training accuracy, challenges conventional understanding of NNs' generalization. Here we extend the previously discovered high-entropy advantage in neural network generalizability, where the high-entropy NNs' states typically correlate with better generalization, and investigate grokking through the lens of Boltzmann entropy. Analogous to physical systems quenched too rapidly, we find the gradient-based optimizers in grokking systems can effectively 'cool' NN parameters too quickly, and therefore trap NNs in non-generalizable minima. This enables us to map grokking to a computational glass relaxation process and situate many previous findings on grokking within this framework. We further identify a distinct high-entropy advantage under grokking, which is notably more pronounced than the analog reported in prior work. Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without imposing any constraints and find high-norm generalizing solutions.

March 18, 2026, 1:24 PM – March 18, 2026, 1:36 PM

Publication: [1] Yang, E., Zhang, X., Shang, Y., & Zhang, G. (2025). High-entropy Advantage in Neural Networks' Generalizability. arXiv:2503.13145. In Revision.
[2] Zhang, X., Shang, Y., Yang, E., & Zhang, G. (2025). Is Grokking a Computational Glass Relaxation? arXiv:2505.11411. Advances in Neural Information Processing Systems, 2025. (Accepted as Spotlight)

Presenters

Entao Yang
- Air Liquide USA

Authors

Entao Yang
- Air Liquide USA
Xiaotian Zhang
Yue Shang
- University of Pennsylvania
Ge Zhang
- City Univ of Hong Kong