The Information Bottleneck Theory of Deep Neural Networks
COFFEE_KLATCH · Invited
Abstract
Multilayered Deep forward Neural Networks (DNN), trained by Stochastic Gradient Decent (SGD), perform amazingly well on multiple supervised learning tasks. Understanding why and how is still a major scientific challenge. In this line of work, we show that large-scale layered networks, when trained with SGD, achieve -- later by layer - the Information Bottleneck optimal universal (architecture independent) tradeoff between sample complexity and accuracy, for problems which are successively refineable in the information theoretic sense. In that sense, DNN's are provably optimal universal learning machines. Moreover, this optimality is achieved through stochastic relaxation via the noisy gradients to \textit{locally} Gibbs distributions on the weights of the network. The theory provides ample new predictions: interpretation of the hidden layers; equivalent architectures for a given task; mechanisms for self-organized hierarchical representations; exactly solvable models and relations to symmetries and invariants; mechanisms for transfer learning; new biologically plausible learning principles, and more. In this talk I will describe some of these predictions and relate them to Bill Bialek's insights and scientific achievements.
–
Authors
-
Naftali Tishby
Hebrew University of Jerusalem, Israel