Looking into the black box: probing internal activations in a data-driven weather model reveals interpretable physical features

ORAL

Abstract

Large data-driven physics models like Deepmind’s GraphCast have empirically succeeded in parameterizing time operators for complex dynamical systems at an accuracy reaching or in some cases exceeding that of classical physics-based solvers. Unfortunately, how these data-driven models perform computations is largely unknown and whether their internal representations correspond to something interpretable or physically plausible is an open question. In this work, we combine tools from interpretability research in Large Language Models and sparsity-promoting methods in dynamical systems to analyze intermediate computational layers in the weather model GraphCast, in particular leveraging sparse autoencoders and dictionary learning to discover preferred directions (features) in the neuron space of the model. We are able to uncover distinct features on a wide range of length and time scales, including features corresponding to tropical cyclones, atmospheric rivers, diurnal behavior, large-scale precipitation patterns, and specific geographical coding, among others. We further demonstrate how precise interventions on these internal features lead to sparse and interpretable model outputs, opening the possibility for explaining model predictions in a human understandable manner or even uncovering unknown physical mechanisms with causal equivalences. As a case study, we sparsely modify internal features in GraphCast to alter the strength of evolving hurricanes.

*T.M. acknowledge financial support from a Stanford Graduate Fellowship and from the National Science Foundation Graduate Research Fellowship Program.

Presenters

  • Theodore MacMillan

    • Stanford University

Authors

  • Theodore MacMillan

    • Stanford University
  • Nicholas T Ouellette

    • Stanford University