Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory
ORAL
Abstract
Quantifying information contents is needed for several problems in atomistic machine learning (ML), from training set curation, uncertainty quantification (UQ), or obtaining insights from large datasets or trajectories. However, atomistic ML often requires unsupervised learning or model predictions to quantify information in simulation or training data. Here, we introduce a theoretical strategy leading to a model-free approach to quantifying information contents in atomistic datasets. We show that the information entropy of atom-centered representations explains common heuristics in atomistic ML, from learning curves to generalization errors. Our method also introduces a UQ strategy to quantify epistemic uncertainty and detect out-of-distribution samples without the need for a model. These results have been used to explain error trends in datasets for ML potentials, detect rare events in simulations, and benchmark the reliability of interatomic potentials. This work provides a new tool for data-driven atomistic simulation with synergistic efforts in ML, simulations, and theory.
*This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory (LLNL) under Contract DE-AC52-07NA27344. The authors acknowledge funding from the Laboratory Directed Research and Development (LDRD) Program at LLNL under project tracking codes 22-ERD-055 and 23-SI-006.
–
Publication: https://doi.org/10.48550/arXiv.2404.12367
Presenters
-
Daniel Schwalbe-Koda
- UCLA