Understanding attention in the mean-field

ORAL

Abstract

Attention is the vehicle driving forward progress in modern deep learning. Understanding how it works is a central question in machine learning. One avenue to tackle this problem is to understand how the geometry of the token vectors changes as they propagate deeper and deeper into the network. We analyze this recursion by means of a replica-symmetric ansatz which gives a starting point for perturbation theory. We reproduce using simple derivations already known facts such as the necessity of initializing near a trivial model, and the importance of pre-layer norm.

* Stanford Graduate Fellowship, National Science Foundation (grant No.211199).

Presenters

  • Aditya Cowsik

    Stanford University

Authors

  • Aditya Cowsik

    Stanford University

  • Surya Ganguli

    Stanford University

  • Tamra Nebabu

    Stanford University

  • Xiao-Liang Qi

    Stanford University