The Inevitable Catastrophe of Neuronal Polysemanticity
ORAL
Abstract
Polysemantic neurons, which respond to multiple, often unrelated features, are empirically observed in most deep neural networks, including state-of-the-art language and vision models. This polysemanticity is hypothesized to arise from feature superposition, where networks exploit feature sparsity to learn far more features than they have dimensions. This phenomenon is a central challenge in mechanistic interpretability, obscuring models’ internal representations and computations.
We draw on tools from statistical mechanics, nonlinear dynamics, and catastrophe theory to study polysemanticity in toy models. Despite their simplicity, these models exhibit rich phenomenology that we analyze theoretically and validate empirically, revealing connections to contemporary interpretability techniques such as dictionary learning and sparse autoencoders. Our work develops a first-principles account of polysemanticity and feature superposition, and identifies architectural, statistical and learning strategies to mitigate or deliberately exploit this phenomenon.
We draw on tools from statistical mechanics, nonlinear dynamics, and catastrophe theory to study polysemanticity in toy models. Despite their simplicity, these models exhibit rich phenomenology that we analyze theoretically and validate empirically, revealing connections to contemporary interpretability techniques such as dictionary learning and sparse autoencoders. Our work develops a first-principles account of polysemanticity and feature superposition, and identifies architectural, statistical and learning strategies to mitigate or deliberately exploit this phenomenon.
*This work was supported by the Laboratory Directed Research & Development (LDRD) program. Computational resources were provided by the National Energy Research Scientific Computing Center (NERSC). Work at the Molecular Foundry was supported by the Office of Science, Office of Basic Energy Sciences, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
–
Presenters
-
Omar A Ashour
- Lawrence Berkeley National Laboratory
- University of California, Berkeley