Probing the Critical Point (CritPt) of AI Reasoning: Benchmarking LLMs at the Frontier of Physics

Minhui Zhu; Minyang Tian; Xiaocheng Yang; Tianci Zhou; Penghao Zhu; Eli Chertkov; Shengyan Liu; Yufeng Du; Lifan Yuan; Ziming Ji; Indranil Das; Junyi Cao; Yufeng Du; Jinchen He; Yifan Su; Peixue Wu; Jiabin Yu; Yikun Jiang; Yujie Zhang; Chang Liu; Daniel Inafuku; Nicholas Chia; Eliu Huerta; Hao Peng

Probing the Critical Point (CritPt) of AI Reasoning: Benchmarking LLMs at the Frontier of Physics

Oral-In-person

Abstract

Are LLMs capable of the original, research-level reasoning required to advance modern physics? Which models and configurations should physicists choose among the exploding number of AI tools?

We present the CritPt (Complex Research using Integrated Thinking - Physics Test), the first benchmark of unpublished, realistic research reasoning tasks spanning condensed matter, quantum, AMO, astrophysics, high energy, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 75 composite challenges simulating full-scale junior-PhD research projects, decomposed into 200 simpler checkpoint tasks for fine-grained behavioral analysis. All problems are newly created by 50+ physicists from their own research, ensuring they are unseen by LLMs and have guess-resistant, machine-verifiable answers.

Using a physics-informed automated evaluation pipeline, we find current models make progress on well-scoped small tasks but remain far from reliably solving full-scale challenges: the strongest base model, GPT-5 (high), achieves only 4.0% average accuracy, rising to ~10% with coding tools. The pipeline also tracks resource usage, revealing inefficiencies and high costs of commercial models. Our interactive visualization tool allows streamlined analysis of large-sclae model outputs and uncovers novel model behavior. The pipeline is hosted online for future tests, guiding the development of scientifically grounded AI tools.

March 17, 2026, 10:00 AM – March 17, 2026, 10:12 AM

Publication: Zhu, M., Tian, M., Yang, X., Zhou, T., Zhu, P., Chertkov, E., ... & Peng, H. (2025). Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark. arXiv preprint arXiv:2509.26574.

Presenters

Minhui Zhu
- Argonne National Laboratory

Authors

Minhui Zhu
- Argonne National Laboratory
Minyang Tian
Xiaocheng Yang
Tianci Zhou
- Massachusetts Institute of Technology
Penghao Zhu
Eli Chertkov
Shengyan Liu
- University of Illinois at Urbana-Champaign
Yufeng Du
Lifan Yuan
Ziming Ji
Indranil Das
- University of Illinois at Urbana-Champaign
Junyi Cao
- University of Illinois at Urbana-Champaign
Yufeng Du
Jinchen He
- University of Maryland College Park
Yifan Su
- Massachusetts Institute of Technology
Peixue Wu
Jiabin Yu
- University of Florida
Yikun Jiang
Yujie Zhang
Chang Liu
- University of Connecticut
Daniel Inafuku
- University of Illinois at Urbana-Champaign
Nicholas Chia
Eliu Huerta
- Argonne National Laboratory
Hao Peng