The role of representation and training set selection for improved machine learning models of matter

ORAL

Abstract

Choice of representation and training set are fundamentally important in machine learning (ML) models of chemical and physical properties of matter. Based on the postulates of quantum mechanics we have developed a hierarchy of representations which meet uniqueness and target similarity criteria. To systematically control target similarity, we rely on interatomic many body expansions, as implemented in universal force-fields, including bags of sorted {\underline B}onding, {\underline A}ngular, and higher order terms (BA). Addition of higher order contributions systematically increases the predictive accuracy of the resulting BAML models. BAML predicts properties of out-of-sample molecules with unprecedented accuracy and speed.\footnote{Huang and von Lilienfeld, {\em J. Chem. Phys.} Comm. {\bf 145}, 161102 (2016)} To select optimal training sets we have developed a rational approach which results in ML models with very rapid error decay.\footnote{Huang and von Lilienfeld, in preparation (2016)} In combination with BAML based atomic representations, these ML models reach chemical accuracy for atomization energies ($\sim$1 kcal/mol) after training on reference results for only hundreds of chemical compounds. Our findings suggest a dramatic reduction in need for data.

Authors

  • Bing Huang

    University of Basel

  • Anatole von Lilienfeld

    University of Basel, Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials (MARVEL), University of Basel