Bounds on learning with power-law priors
ORAL
Abstract
Modern machine-leaning architectures often achieve good generalization despite having enough parameters to express any function on the training data. This is surprising, since such flexibility suggests they should "overfit'' and generalize poorly. In order to generalize well in the regime where any function can be expressed, a learning machine must have a good "inductive bias'': although any function may be expressed, some must be strongly disfavored. We study the inductive biases of many expressive classifiers through the distribution of functions produced by random parameter values, a proxy for their induced Bayesian priors and the corresponding inductive bias. These experiments reveal a universal power-law, "Zipfian'' prior in the space of functions. Here we rationalize the universality of this prior by studying the implications of power-law tails in the prior for Bayesian learning in the overparametrized regime. We show that any tail broader than Zipfian implies that a learning machine will fail to generalize on unseen data, while a narrower tail limits the number of functions that can be learned. This implies that the type of prior distribution seen in commonly-used learning machines is the only type of prior which can allow successful learning in the overparameterized regime.
* This work was supported in part by NSF, NIH, and the Simons Foundation
–
Presenters
-
Sean A Ridout
Emory University
Authors
-
Sean A Ridout
Emory University
-
Ilya M Nemenman
Emory, Emory University
-
Ard A Louis
University of Oxford
-
Chris Mingard
University of Oxford
-
Radosław Grabarczyk
University of Oxford
-
Kamaludin Dingle
Gulf University for Science & Technology
-
Guillermo Valle Pérez
University of Oxford
-
Charles London
University of Oxford