Avoiding a reproducibility crisis in deep learning for surrogate potentials: How massively parallel programming, millions of training steps, and numerics combine to create non-determinism in models and what this means for the simulated physics

ORAL

Abstract

Deep learning has recently provided ground-breaking results in scientific areas, including for molecular dynamics simulations. For many researchers, the possibility of training an interatomic potential on first principles data, then performing molecular dynamics with quantum mechanical accuracy at the speeds of empirical models, represents a fantasy realized. Deep learning frameworks such as TensorFlow and PyTorch rely heavily on massively-parallel speedups from graphics processing units (GPUs); data-distributed parallel training across multiple GPUs is also common. Recently, however, deep learning training has been discovered to produce significantly different models when trained repeatedly using identical algorithms, hyperparameters and input data, even with all random seeds set. This non-determinism is due to floating-point non-associativity coupled with atomic operations, when performing iterative calculations millions to billions of times. Non-reproducibility of model training has interfered with approvals for safety-critical autonomous vehicle software and medical diagnostics. Similar repercussions are expected for scientific applications. Here we study the effects of non-determinism on the models produced for deep neural network potentials for molecular dynamics simulations, including how this model variability translates to variability in the description of the physical phase space and associated observables, and offer tips and best practices for achieving correctness and reproducibility within this new paradigm.

* This work is supported by the Oak Ridge National Laboratory, under the Laboratory Directed Research and Development Program (LDRD 11288 and 11506).

Publication: Coletti M, Sedova A, Chahal R, Gibson L, Roy S, Bryantsev V. Multiobjective Hyperparameter Optimization for Deep Learning Interatomic Potential Training Using NSGA-II. In: Proceedings of the 52nd International Conference on Parallel Processing Workshops 2023 Aug 7 (pp. 172-179).
Planned work: Understanding numerical reproducibility in training and application of deep learning surrogate potentials for physics

Presenters

  • Ada Sedova

    Oak Ridge National Laboratory

Authors

  • Ada Sedova

    Oak Ridge National Laboratory

  • Ganesh Sivaraman

    Argonne National Laboratory

  • Mark Coletti

    Oak Ridge National Laboratory

  • Wael Elwasif

    Oak Ridge National Laboratory

  • Micholas D Smith

    University of Tennessee, Knoxville

  • Oscar Hernandez

    Oak Ridge National Laboratory