Scale-dependent relationships in human language

POSTER

Abstract

Many models of human language extract statistical regularities of natural language to estimate “meaning”. Mutual information between words in natural language has been shown to decay as a power law (Lin and Tegmark, 2017). Despite this evidence for scale-invariant statistics, statistical models of language typically impose a strong scale. We study the scale-dependence of language using Word2Vec (Mikolov et al., 2013), a shallow neural network model which generates a vector embedding of words by training over a corpus of text. We modify the Word2Vec algorithm to choose neighbors of a target word with an exponentially decaying distribution, and look at the embedding generated for a broad spectrum of scale parameters. It seems to appear that different syntactic and semantic relations (as classified in several tests developed by Mikolov et al.) seem to be best expressed at different word scales. Word similarities between neighbors seem to capture qualitatively different behavior across a range of word-scales, often peaking at distances that would not be captured by an embedding sampled at a single, particular scale. These results point toward the importance of developing scale-free models of semantic meaning.

Presenters

  • Aakash Sarkar

    Psychological and Brain Sciences, Physics, Boston University

Authors

  • Aakash Sarkar

    Psychological and Brain Sciences, Physics, Boston University

  • Marc Howard

    Boston University, Psychological and Brain Sciences, Physics, Boston University