Statistical Mechanics of Semantic Compression

ORAL · Invited

Abstract

I consider the problem of semantic compression: how short can you make a message while keeping its meaning? This can be quantified by introducing a measure of semantic similarity. The idea of a continuous semantic vector space has gained traction in both experimental cognitive psychology as well as machine learning and artificial intelligence. In such a space, input data like words, text, or pictures, are mapped to high-dimensional vectors, and the relationship between these "semantic embeddings" determines their meaning. This suggests that a natural metric for semantic similarity is just the Euclidean distance in this semantic vector space. Equipped with this, I formulate a combinatorial optimization problem for determining the minimal length of a message which satisfies a bound on the semantic distance, or distortion. This optimization problem can be mapped to a statistical mechanical model of a two-spin Hopfield spin glass, which can be solved using replica theory. In this work, I map out the phase diagram of this model in an idealized setting in which the elements of a finite but large lexicon all have random embeddings. I will finally comment on extensions and implications of these results for semantic compression with more realistic embeddings, such as those encountered machine learning.

* I acknowledge the support of the Eric and Wendy Schmidt Membership in Biology, the Simons Foundation, and the Starr Foundation Member Fund in Biology at the Institute for Advanced Study.

Presenters

  • Tankut U Can

    Institute for Advanced Study

Authors

  • Tankut U Can

    Institute for Advanced Study