Is Grokking a Computational Glass Relaxation?
ORAL
Abstract
Grokking, the surprising phenomenon where neural networks (NNs) suddenly generalize long after achieving (almost) perfect training accuracy, challenges conventional understanding of NNs' generalization. Here we extend the previously discovered high-entropy advantage in neural network generalizability, where the high-entropy NNs' states typically correlate with better generalization, and investigate grokking through the lens of Boltzmann entropy. Analogous to physical systems quenched too rapidly, we find the gradient-based optimizers in grokking systems can effectively 'cool' NN parameters too quickly, and therefore trap NNs in non-generalizable minima. This enables us to map grokking to a computational glass relaxation process and situate many previous findings on grokking within this framework. We further identify a distinct high-entropy advantage under grokking, which is notably more pronounced than the analog reported in prior work. Inspired by grokking's far-from-equilibrium nature, we develop a toy optimizer WanD based on Wang-landau molecular dynamics, which can eliminate grokking without imposing any constraints and find high-norm generalizing solutions.
–
Publication: [1] Yang, E., Zhang, X., Shang, Y., & Zhang, G. (2025). High-entropy Advantage in Neural Networks' Generalizability. arXiv:2503.13145. In Revision.
[2] Zhang, X., Shang, Y., Yang, E., & Zhang, G. (2025). Is Grokking a Computational Glass Relaxation? arXiv:2505.11411. Advances in Neural Information Processing Systems, 2025. (Accepted as Spotlight)
Presenters
-
Entao Yang
- Air Liquide USA