Improving Scarce Data Workflows and Generating Hypotheses for Metastable Chemical Vapor Deposition Growth Using Large Language Models
ORAL · Invited
Abstract
I present two studies showing how large language models (LLMs) can accelerate materials discovery. In the first part, I investigate LLM-based data imputation and feature engineering to address data scarcity in two small, sparsely populated datasets: a graphene CVD dataset compiled from the literature and the DOE HydPARK metal hydrides database. We compare LLM-generated imputations with traditional statistical methods such as k-nearest neighbors and multivariate imputation by chained equations. By systematically varying prompting strategies, from generic prompts that emphasize autonomous LLM inference to data-informed prompts that tightly constrain the output, we show how the degree of model autonomy affects imputation accuracy and distribution, and demonstrate when LLM imputation outperforms statistical methods. We further show that LLMs can standardize inconsistent categorical features, such as substrate labels. Integrating these LLM-based data engineering methods with downstream machine-learning tasks substantially boosts performance on data-scarce settings; for example, graphene layer-number classification improves from 39% to 65% (binary) and from 52% to 72% (ternary). Together, these results highlight how LLMs offer an effective, computation-free approach to mitigating data sparsity while harmonizing heterogeneous materials datasets.
In the second part, I outline ongoing work in which we use decoder-only LLMs combined with retrieval-augmented generation and structured prompting to construct an AI-based expert in CVD diamond at ambient conditions, a synthesis route for a metastable phase of carbon. By grounding the LLM in curated mechanistic literature and experimental reports, we aim to create a controllable reasoning engine capable of proposing physically plausible growth mechanisms, processing windows, and design hypotheses. We then extend this framework to explore whether analogous metastable-phase synthesis pathways may exist for other chemistries, with a particular focus on high-pressure phases of boron-rich materials. This approach enables the generation of LLM-guided hypotheses for realizing high-pressure phases using CVD-like methods.
*This research was supported in part by the National Science Foundation (NSF) under Award Number DMR-2119308 and by the DOE-NNSA cooperative agreement DE-NA-0003975 (Chicago/DOE Alliance Center, CDAC).
–
Publication: https://arxiv.org/abs/2503.04870
Presenters
-
Sara Kadhodaei
- U Illinois Chicago