Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory

ORAL

Abstract

Quantifying information contents is needed for several problems in atomistic machine learning (ML), from training set curation, uncertainty quantification (UQ), or obtaining insights from large datasets or trajectories. However, atomistic ML often requires unsupervised learning or model predictions to quantify information in simulation or training data. Here, we introduce a theoretical strategy leading to a model-free approach to quantifying information contents in atomistic datasets. We show that the information entropy of atom-centered representations explains common heuristics in atomistic ML, from learning curves to generalization errors. Our method also introduces a UQ strategy to quantify epistemic uncertainty and detect out-of-distribution samples without the need for a model. These results have been used to explain error trends in datasets for ML potentials, detect rare events in simulations, and benchmark the reliability of interatomic potentials. This work provides a new tool for data-driven atomistic simulation with synergistic efforts in ML, simulations, and theory.

*This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory (LLNL) under Contract DE-AC52-07NA27344. The authors acknowledge funding from the Laboratory Directed Research and Development (LDRD) Program at LLNL under project tracking codes 22-ERD-055 and 23-SI-006.

Publication: https://doi.org/10.48550/arXiv.2404.12367

Presenters

  • Daniel Schwalbe-Koda

    • UCLA

Authors

  • Daniel Schwalbe-Koda

    • UCLA
  • Sebastien Hamel

    • Lawrence Livermore National Laboratory
  • Babak Sadigh

    • Lawrence Livermore National Laboratory
  • Fei Zhou

    • LLNL
    • Lawrence Livermore National Laboratory
  • Vincenzo Lordi

    • Lawrence Livermore National Laboratory