Identifying authenticity in student writing using text embeddings

Carlos Kaskoun; Mike Verostek; Natasha Holmes

Identifying authenticity in student writing using text embeddings

Oral-In-person

Abstract

The adoption of machine learning and Large Language Models (LLMs) is a new and exciting field within Physics Education Research. In our introductory physics lab courses, hundreds of students write memos as a final assignment, sharing information about their experiences and projects. To analyze authenticity, we then code their memos - characterizing them by the themes present - using a previously developed codebook. This coding is done manually, and coding it all by hand would require a large amount of time and energy. To allow us to complete faster and larger analyses of data more efficiently, we have been building on text embedding methods developed by Odden et al. Text embeddings take text as an input, and using a pre-trained LLM, return a high-dimensional 'meaning' vector to represent the themes present in the text. I will discuss the novel methods we are exploring using text embeddings to automate the sorting of longer student reflections into specific codes, improve speed, and assess accuracy with hand-coded data. The adoption of these new methods would allow for faster, more efficient, and more consistent thematic coding for a broader subset of researchers.

March 18, 2026, 5:06 PM – March 18, 2026, 5:18 PM

Presenters

Carlos Kaskoun
- Cornell University

Authors

Carlos Kaskoun
- Cornell University
Mike Verostek
- Cornell University
Natasha Holmes
- Cornell University