On the Effect of Training Data on Machine Learning Phonon Dispersion

Jaesuk Park; Feliciano Giustino

On the Effect of Training Data on Machine Learning Phonon Dispersion

Oral-In-person

Abstract

Recent advances of machine learning interatomic potential (MLIP) architectures have improved both the accuracy and scalability of energy and force predictions in chemical systems for many practical applications. Less attention was paid on the data and the pipeline used to train MLIP models, specifically on what aspects of the training data (generated through ab initio simulations) and pipeline impact the predictions of quantities of interest by how much. Here, taking diamond phonon dispersion prediction as a case study, we present examples of how different properties of the training dataset and the training pipeline affect phonon dispersion predictions. Specifically, we point out the roles of planewave cutoff, simulation cell size, training dataset size, and the random seed for model parameter initialization. Potential strategies to mitigate these sources of variation are also discussed.

March 19, 2026, 9:24 AM – March 19, 2026, 9:36 AM

Presenters

Jaesuk Park
- University of Texas at Austin

Authors

Jaesuk Park
- University of Texas at Austin
Feliciano Giustino
- The University of Texas at Austin