Evaluating Capabilities of Large Language Models on Undergraduate Physics Concepts

Emily Owens; Julie L Butler

Evaluating Capabilities of Large Language Models on Undergraduate Physics Concepts

ORAL

Abstract

This study investigates the effectiveness of transformer-based large language models (LLMs) in solving and explaining undergraduate-level physics problems. Given their ability to capture contextual relationships, transformers are well-suited for language-based tasks, making them valuable tools for educational applications. This project evaluated a range of models using questions from established physics concept inventories and undergraduate physics textbooks. Both general purpose LLMs and domain-specific fine-tuned variants were included, allowing direct comparison of broad and specialized training approaches. Commercial models were also examined through publicly available interfaces. Responses were assessed on correctness and interpretability. Traditional performance metrics such as accuracy, precision, recall, and F1 score were used to quantify correctness, while interpretability was evaluated by rating the clarity and instructional value of the LLMs’ explanations. This study also examined the capacity of LLMs to act as judge, exploring their potential role in automated assessment. Through its evaluation of accuracy and interpretability, this work offers a more comprehensive understanding of LLM performance in physics education than accuracy-focused benchmarks alone. This analysis explores LLMs’ abilities as problem solvers and instructional supports, and its findings contribute to the broader discussion of artificial intelligence in education.

Oct. 25, 2025, 9:25 AM – Oct. 25, 2025, 9:37 AM

Presenters

Emily Owens

University of Mount Union

Authors

Emily Owens

University of Mount Union
Julie L Butler

University of Mount Union