The Inevitable Catastrophe of Neuronal Polysemanticity

Omar Ashour; Sinéad Griffin

The Inevitable Catastrophe of Neuronal Polysemanticity

Oral-In-person

Abstract

Polysemantic neurons, which respond to multiple, often unrelated features, are empirically observed in most deep neural networks, including state-of-the-art language and vision models. This polysemanticity is hypothesized to arise from feature superposition, where networks exploit feature sparsity to learn far more features than they have dimensions. This phenomenon is a central challenge in mechanistic interpretability, obscuring models’ internal representations and computations.

We draw on tools from statistical mechanics, nonlinear dynamics, and catastrophe theory to study polysemanticity in toy models. Despite their simplicity, these models exhibit rich phenomenology that we analyze theoretically and validate empirically, revealing connections to contemporary interpretability techniques such as dictionary learning and sparse autoencoders. Our work develops a first-principles account of polysemanticity and feature superposition, and identifies architectural, statistical and learning strategies to mitigate or deliberately exploit this phenomenon.

March 19, 2026, 12:36 PM – March 19, 2026, 12:48 PM

Presenters

Omar Ashour
- Lawrence Berkeley National Laboratory

Authors

Omar Ashour
- Lawrence Berkeley National Laboratory
Sinéad Griffin
- Lawrence Berkeley National Laboratory