Probing the Critical Point (CritPt) of AI Reasoning: Benchmarking LLMs at the Frontier of Physics
ORAL
Abstract
Are LLMs capable of the original, research-level reasoning required to advance modern physics? Which models and configurations should physicists choose among the exploding number of AI tools?
We present the CritPt (Complex Research using Integrated Thinking - Physics Test), the first benchmark of unpublished, realistic research reasoning tasks spanning condensed matter, quantum, AMO, astrophysics, high energy, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 75 composite challenges simulating full-scale junior-PhD research projects, decomposed into 200 simpler checkpoint tasks for fine-grained behavioral analysis. All problems are newly created by 50+ physicists from their own research, ensuring they are unseen by LLMs and have guess-resistant, machine-verifiable answers.
Using a physics-informed automated evaluation pipeline, we find current models make progress on well-scoped small tasks but remain far from reliably solving full-scale challenges: the strongest base model, GPT-5 (high), achieves only 4.0% average accuracy, rising to ~10% with coding tools. The pipeline also tracks resource usage, revealing inefficiencies and high costs of commercial models. Our interactive visualization tool allows streamlined analysis of large-sclae model outputs and uncovers novel model behavior. The pipeline is hosted online for future tests, guiding the development of scientifically grounded AI tools.
We present the CritPt (Complex Research using Integrated Thinking - Physics Test), the first benchmark of unpublished, realistic research reasoning tasks spanning condensed matter, quantum, AMO, astrophysics, high energy, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 75 composite challenges simulating full-scale junior-PhD research projects, decomposed into 200 simpler checkpoint tasks for fine-grained behavioral analysis. All problems are newly created by 50+ physicists from their own research, ensuring they are unseen by LLMs and have guess-resistant, machine-verifiable answers.
Using a physics-informed automated evaluation pipeline, we find current models make progress on well-scoped small tasks but remain far from reliably solving full-scale challenges: the strongest base model, GPT-5 (high), achieves only 4.0% average accuracy, rising to ~10% with coding tools. The pipeline also tracks resource usage, revealing inefficiencies and high costs of commercial models. Our interactive visualization tool allows streamlined analysis of large-sclae model outputs and uncovers novel model behavior. The pipeline is hosted online for future tests, guiding the development of scientifically grounded AI tools.
*This work was partially supported by the US Department of Energy at Argonne National Lab through LDRD funding (DE-AC02-06CH11357) and the Argonne Leadership Computing Facility (Contract DE-AC02-06CH11357), and DeltaAI (award OAC 2320345) from the University of Illinois Urbana-Champaign and its National Center for Supercomputing Applications. EH was partially supported by NSF grants OAC-2209892 and OAC-2514142.
–
Publication: Zhu, M., Tian, M., Yang, X., Zhou, T., Zhu, P., Chertkov, E., ... & Peng, H. (2025). Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark. arXiv preprint arXiv:2509.26574.
Presenters
-
Minhui Zhu
- Argonne National Laboratory