Bounds on learning with power-law priors

ORAL

Abstract

Modern machine-leaning architectures often achieve good generalization despite having enough parameters to express any function on the training data. This is surprising, since such flexibility suggests they should "overfit'' and generalize poorly. In order to generalize well in the regime where any function can be expressed, a learning machine must have a good "inductive bias'': although any function may be expressed, some must be strongly disfavored. We study the inductive biases of many expressive classifiers through the distribution of functions produced by random parameter values, a proxy for their induced Bayesian priors and the corresponding inductive bias. These experiments reveal a universal power-law, "Zipfian'' prior in the space of functions. Here we rationalize the universality of this prior by studying the implications of power-law tails in the prior for Bayesian learning in the overparametrized regime. We show that any tail broader than Zipfian implies that a learning machine will fail to generalize on unseen data, while a narrower tail limits the number of functions that can be learned. This implies that the type of prior distribution seen in commonly-used learning machines is the only type of prior which can allow successful learning in the overparameterized regime.

* This work was supported in part by NSF, NIH, and the Simons Foundation

Presenters

  • Sean A Ridout

    Emory University

Authors

  • Sean A Ridout

    Emory University

  • Ilya M Nemenman

    Emory, Emory University

  • Ard A Louis

    University of Oxford

  • Chris Mingard

    University of Oxford

  • Radosław Grabarczyk

    University of Oxford

  • Kamaludin Dingle

    Gulf University for Science & Technology

  • Guillermo Valle Pérez

    University of Oxford

  • Charles London

    University of Oxford