Understanding attention in the mean-field
ORAL
Abstract
Attention is the vehicle driving forward progress in modern deep learning. Understanding how it works is a central question in machine learning. One avenue to tackle this problem is to understand how the geometry of the token vectors changes as they propagate deeper and deeper into the network. We analyze this recursion by means of a replica-symmetric ansatz which gives a starting point for perturbation theory. We reproduce using simple derivations already known facts such as the necessity of initializing near a trivial model, and the importance of pre-layer norm.
* Stanford Graduate Fellowship, National Science Foundation (grant No.211199).
–
Presenters
-
Aditya Cowsik
Stanford University
Authors
-
Aditya Cowsik
Stanford University
-
Surya Ganguli
Stanford University
-
Tamra Nebabu
Stanford University
-
Xiao-Liang Qi
Stanford University