Probing the Critical Point (CritPt) of AI Reasoning: Benchmarking LLMs at the Frontier of Physics

ORAL

Abstract

Are LLMs capable of the original, research-level reasoning required to advance modern physics? Which models and configurations should physicists choose among the exploding number of AI tools?

We present the CritPt (Complex Research using Integrated Thinking - Physics Test), the first benchmark of unpublished, realistic research reasoning tasks spanning condensed matter, quantum, AMO, astrophysics, high energy, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 75 composite challenges simulating full-scale junior-PhD research projects, decomposed into 200 simpler checkpoint tasks for fine-grained behavioral analysis. All problems are newly created by 50+ physicists from their own research, ensuring they are unseen by LLMs and have guess-resistant, machine-verifiable answers.

Using a physics-informed automated evaluation pipeline, we find current models make progress on well-scoped small tasks but remain far from reliably solving full-scale challenges: the strongest base model, GPT-5 (high), achieves only 4.0% average accuracy, rising to ~10% with coding tools. The pipeline also tracks resource usage, revealing inefficiencies and high costs of commercial models. Our interactive visualization tool allows streamlined analysis of large-sclae model outputs and uncovers novel model behavior. The pipeline is hosted online for future tests, guiding the development of scientifically grounded AI tools.

*This work was partially supported by the US Department of Energy at Argonne National Lab through LDRD funding (DE-AC02-06CH11357) and the Argonne Leadership Computing Facility (Contract DE-AC02-06CH11357), and DeltaAI (award OAC 2320345) from the University of Illinois Urbana-Champaign and its National Center for Supercomputing Applications. EH was partially supported by NSF grants OAC-2209892 and OAC-2514142.

Publication: Zhu, M., Tian, M., Yang, X., Zhou, T., Zhu, P., Chertkov, E., ... & Peng, H. (2025). Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark. arXiv preprint arXiv:2509.26574.

Presenters

  • Minhui Zhu

    • Argonne National Laboratory

Authors

  • Minhui Zhu

    • Argonne National Laboratory
  • Minyang Tian

    • Argonne National Laboratory, University of Illinois Urbana-Champaign
  • Xiaocheng Yang

    • University of Illinois Urbana-Champaign
  • Tianci Zhou

    • Massachusetts Institute of Technology
    • Virginia Polytechnic Institute and State University
    • Virginia Tech
  • Penghao Zhu

    • Ohio State University
  • Eli Chertkov

    • Independent
    • Quantinuum
  • Shengyan Liu

    • University of Illinois at Urbana-Champaign
  • Yufeng Du

    • University of Illinois Urbana-Champaign
  • Lifan Yuan

    • University of Illinois Urbana-Champaign
  • Ziming Ji

    • Northeastern University
  • Indranil Das

    • University of Illinois at Urbana-Champaign
  • Junyi Cao

    • University of Illinois at Urbana-Champaign
  • Yufeng Du

    • California Institute of Technology
  • Jinchen He

    • University of Maryland College Park
  • Yifan Su

    • Massachusetts Institute of Technology
    • Columbia University
  • Peixue Wu

    • University of Waterloo
  • Jiabin Yu

    • University of Florida
  • Yikun Jiang

    • Northeastern University
  • Yujie Zhang

    • Perimeter Institute for Theoretical Physics, University of Waterloo
  • Chang Liu

    • University of Connecticut
  • Daniel A Inafuku

    • University of Illinois at Urbana-Champaign
  • Nicholas Chia

    • Argonne National Laboratory
  • Eliu Huerta

    • Argonne National Laboratory
  • Hao Peng

    • University of Illinois Urbana-Champaign