Constructing Superconductivity and Magnetism Databases using Large Language Models

ORAL

Abstract

Large language models can effectively process natural language and extract key information from even very technical text. We use this capability to process several thousand materials science and condensed matter papers in order to obtain material parameters and experimental techniques. This work encompasses the entire large language model stack: First, we construct the pipeline for constructing plain text from pdfs and problems in correcting text from optical character recognition. We also discuss extending default tokenizers to include more technical text and commonly used abbreviations, characters, and chemical formulas. Different prompting strategies are discussed. We carefully examine error in material parameter extraction. The effectiveness of the pipeline is tested on a human-labelled superconducting material database, which also provides a convenient source of training data. Finally, we compare several large language models of different size and fine-tuning of the models in order to speed up inference.

*This research was partially supported by the National Science Foundation Materials Research Science and Engineering Center program through the UT Knoxville Center for Advanced Materials and Manufacturing (DMR-2309083) and the AI TENNessee Initiative.

Publication: Planned paper:
Constructing Superconductivity and Magnetism Databases using Large Language Models

Presenters

  • Louis D Primeau

    • University of Tennessee

Authors

  • Louis D Primeau

    • University of Tennessee
  • Yang Zhang

    • University of Tennessee
  • Adrian Del Maestro

    • University of Tennessee
    • University of Tennessee-Knoxville