PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving

2503.21821

Kaiyue Feng et el.

🤗 2025-03-31

TL;DR
#

AI models have shown promise in math, yet physics remains a hurdle. Existing physics datasets often consist of multiple-choice questions or focus on primary-high school level problems that frontier models perform well on. To fill the gap, the study introduces PHYSICS, a benchmark consisting of 1,297 expert-annotated problems spanning six core areas that need advanced physics knowledge and mathematical reasoning.

PHYSICS assesses AI using open-ended questions. The benchmark reveals limitations: the best model only achieves 59.9% accuracy. Key issues include incorrect assumptions, data understanding, calculation accuracy, and question interpretation. The study explores diverse prompting strategies and Retrieval-Augmented Generation (RAG) to improve performance, identifying areas for future advancement.

Key Takeaways
#

Why does it matter?
#

This PHYSICS benchmark provides a critical tool for evaluating and improving AI models in advanced physics problem-solving. It is important for researchers because it highlights current limitations and guides future development toward more robust, scientifically grounded AI.

Visual Insights
#

🔼 Figure 1 showcases a sample problem from the PHYSICS benchmark dataset, specifically focusing on classical mechanics. The problem presents a diagram of a siphon and asks for calculations related to water flow speed and maximum height, requiring application of Bernoulli’s equation. The caption highlights PHYSICS as a comprehensive benchmark comprising 1,297 expert-annotated university-level physics problems.
read the caption
Figure 1: An example of classical mechanics problem in \ours. \oursis a comprehensive benchmark for university-level physics problem solving which contains 1,297 expert-annotated problems.

Benchmark	Multi-modal	Size	Level	Question Type	Evaluation	Reasoning Steps
JEEBench Arora et al. (2023)	✗	515	CEE	OE, MC	Rule-Based	-
MATH Hendrycks et al. (2021)	✗	12,500	K12-Comp	OE	Rule-Based	-
HARDMath Fan et al. (2024)	✗	1,466	Graduate	OE	Rule + Model	-
GSM8K(Cobbe et al., 2021)	✗	8,500	K8	OE	Rule-Based	5.0
GPQA(Rein et al., 2024)	✗	227	Graduate	OE	Rule-Based	3.6
SciQ Welbl et al. (2017)	✗	13,679	K4-K8	MC, OE	Rule-Based	-
SciEval Sun et al. (2023)	✗	1657	-	OE, MC	Rule-Based	-
SciBench Wang et al. (2024)	✓	295	University	OE	Rule-Based	2.8
MMMU Yue et al. (2024a)	✓	443	University	OE, MC	Rule-Based	-
MMMU-Pro Yue et al. (2024b)	✓	3,460	University	MC	Rule-Based	-
MMVU Zhao et al. (2025)	✓	3,000	University	OE, MC	Rule + Model	-
ScienceQA Lu et al. (2022)	✓	617	K1-K12	MC	Rule-Based	2.4
OlympiadBench(He et al., 2024)	✓	2334	Comp	OE	Rule-Based	3.7
PutnamBench Tsoukalas et al. (2024)	✗	1692	University	OE	Rule-Based	-
\ours	✓	1297	University	OE	Rule + Model	5.7

🔼 This table compares the PHYSICS benchmark with other existing benchmarks across several key features. These features include the modality (whether the benchmark uses text only or text and images), the size of the benchmark (number of questions), the educational level of the questions (ranging from elementary school to graduate level), the type of questions (multiple choice or open-ended), the evaluation method used, and the average number of reasoning steps required to solve the problems. The table provides a context for understanding the relative difficulty and scope of the PHYSICS benchmark in comparison to others.
read the caption
Table 1: Comparison of \ourswith other benchmarks. For Level, comp: Competition, CEE: University Entrance Examination, K1-K12: Elementary and High School Level. For Question Type, OE: Open-ended Questions, MC: Multiple-choice Questions. Reasoning Steps are based on the statistics provided in the corresponding reference.

In-depth insights
#

Physics FM Limits
#

Foundation Models (FM) face considerable limits when applied to physics due to the domain’s reliance on precise mathematical formulations and complex reasoning. Unlike tasks where FMs can leverage pattern recognition, physics demands a deep understanding of underlying principles and accurate equation manipulation. Current FMs often struggle with multi-step problem-solving requiring the integration of diverse concepts. Furthermore, FMs may lack the inherent physical intuition necessary to correctly apply formulas and interpret results within real-world contexts. Addressing these limits requires improving FMs’ ability to handle symbolic mathematics, integrate domain-specific knowledge, and perform robust reasoning.

PHYSICS Dataset
#

The PHYSICS dataset appears to be a specifically curated collection intended for evaluating foundation models in their capacity to tackle university-level physics problems. The key strength likely resides in its composition of problems demanding advanced knowledge and mathematical reasoning, potentially sourced from PhD qualifying exams to ensure high difficulty. It is designed to assesses the models’ abilities in core physics areas such as classical mechanics, quantum mechanics, thermodynamics, electromagnetism, atomic physics, and optics. By incorporating expert-annotated problems with a high level of complexity, the dataset avoids the limitations of multiple-choice formats. This helps in enabling a more thorough and accurate evaluation of the models’ physics problem-solving skills in open-ended scenarios and complex reasoning tasks.

Automated Eval
#

The automated evaluation component is a critical aspect of modern benchmarking. It allows for objective and consistent assessment of model performance. Key considerations include ensuring the system can accurately extract solutions, especially in formats such as LaTeX, and handle varying symbolic representations. A robust automated evaluation framework is characterized by its ability to standardize mathematical expressions, verify correctness through rule-based equivalence checking (e.g., using SymPy), and accurately assess models in cases where results do not directly match. This evaluation should also address the inherent challenges of AI such as logical reasoning, conceptual physics problems, and mathematical precision. GPT-4 assisted evaluation can augment rule-based assessments. This is important to evaluate nuanced solutions. This ensures the system does not dismiss correct, but unconventionally derived answers, enhancing reliability and fairness. High-quality automated evaluation is an essential part to reduce costs and make the research more reproducible.

RAG for Physics
#

Retrieval-Augmented Generation (RAG) for Physics is a promising avenue for enhancing the capabilities of foundation models in this domain. Since physics problems often demand integrating diverse knowledge pieces, RAG enables models to retrieve relevant information from external sources like textbooks or scientific literature. This augmentation mitigates the limitations of models’ internal knowledge, particularly when dealing with specialized concepts or complex derivations. By grounding the reasoning process in retrieved evidence, RAG can improve the accuracy and reliability of solutions. However, challenges include formulating effective search queries, selecting relevant information from retrieved content, and seamlessly integrating retrieved knowledge into the reasoning process. This requires models to discern essential concepts and relationships within complex physics problems and formulate queries that retrieve precise and contextually relevant information. Effectively managing the retrieved information and combining it with existing knowledge to produce accurate physics problem solutions is the key to success for RAG in physics.

Multi-Modal Data
#

Multi-modal data is crucial for a comprehensive understanding of complex phenomena. Integrating diverse data types, such as text, images, and audio, provides a richer context and enables more nuanced analysis. For example, in medical diagnosis, combining patient history (text), X-ray images, and heart sounds (audio) can improve accuracy. Challenges include data synchronization, feature extraction across modalities, and dealing with heterogeneous data formats. Deep learning models, particularly those with attention mechanisms, are effective in learning cross-modal representations. Future research should focus on developing robust and interpretable methods for multi-modal fusion. This will drive advances in various fields, from AI to scientific discovery.

More visual insights
#

More on tables

Category	Value
Dataset Overview
Total Questions	1,297
Questions with Figures	298
Validation : Test Split	297 : 1,000
Hard : Regular Questions	523 : 774
Subject Distribution
Number of Subjects	6
Atomic Physics	200
Electromagnetism	242
Classical Mechanics	221
Optics	158
Quantum Mechanics	236
Statistical Physics	240
Question Complexity
Avg. Question Length (words)	83.7
Solution Statistics
Avg. Solution Length (words)	234.75
Avg. Reasoning Steps	5.38

🔼 Table 2 presents a detailed statistical overview of the PHYSICS dataset, a benchmark for evaluating AI models’ ability to solve university-level physics problems. It shows the total number of questions, the number of questions with figures, the split of the dataset into validation and test sets, the breakdown of questions by physics subject area, the average length of questions and solutions (in words), and the average number of reasoning steps required to solve the problems. It also provides a breakdown of questions categorized as ‘hard’ based on annotator assessment. This comprehensive statistical summary allows researchers to understand the characteristics of the PHYSICS dataset and to assess its suitability for their research needs.
read the caption
Table 2: Dataset statistics of \ours.

	Test Set Performance						Overall
Model	AMO	E&M	CM	Opt.	QM	Stats.	Val	Test
Proprietary Models
o3-mini	52.4	64.9	59.8	51.5	66.0	60.0	55.0	59.9
o1-mini	45.4	41.8	41.9	40.6	44.3	48.0	44.1	43.6
Gemini-1.5-pro^†	35.5	40.2	31.5	32.2	44.5	43.7	35.3	38.4
GPT-4o^†	35.3	44.1	33.4	23.4	33.8	45.0	34.7	36.7
Claude-3.5-Sonnet^†	37.2	34.8	27.6	35.5	35.1	38.4	31.7	34.7
Open-Source Models
DeepSeek-R1	37.0	48.6	38.3	43.1	44.5	51.5	44.2	44.3
Qwen2.5-Math-72B	27.0	34.8	27.3	27.4	36.2	37.0	38.5	32.2
Llama-3.3-70B	28.2	35.8	27.9	17.2	31.4	41.3	34.3	31.5
phi-4	32.8	33.0	19.8	27.2	23.4	35.2	28.7	29.1
Qwen2.5-72B	28.8	30.9	23.0	25.4	27.4	33.2	31.5	28.7
Qwen2.5-32B	25.5	27.5	19.4	20.8	24.7	41.1	23.3	27.6
Mistral-Small-24B	19.1	29.5	19.6	17.6	15.2	28.4	25.1	21.8
Qwen2.5-7B	21.8	28.1	11.2	18.7	17.4	22.1	20.9	20.4
Qwen2.5-14B	23.8	19.7	14.1	12.3	13.5	28.2	25.3	19.6
Gemma-2-27b	14.3	19.0	16.2	13.4	18.4	25.9	21.7	18.3
Yi-1.5-34B	11.0	15.4	18.0	13.2	19.6	25.2	25.3	17.4
Qwen2.5-Math-1.5B	13.3	14.8	16.5	16.2	17.2	19.5	15.1	16.4
InternVL2-5-38B^†	15.3	12.5	12.5	7.7	18.0	23.1	16.7	15.3
Aria^†	13.0	14.0	14.2	11.7	9.7	14.4	12.7	12.9
QwQ-32B-Preview	16.7	7.5	10.1	11.2	10.6	14.8	12.4	12.1
Gemma-2-9b	9.4	8.2	9.1	16.5	12.1	16.9	15.2	11.9
Mistral-7B	10.1	10.4	5.1	13.7	11.6	17.6	12.6	11.7
Llama-3.1-8B	8.4	17.4	6.8	14.7	7.4	16.1	9.1	11.7
Mathstral-7B	7.3	10.0	12.0	9.6	8.2	17.6	12.0	10.8
c4ai-command-r-v01	2.0	7.8	7.5	3.8	7.5	11.4	6.8	7.0
DeepSeek-R1-Distill-Qwen-32B	9.1	5.4	4.8	9.8	2.3	10.2	7.1	6.8
Gemma-2-2b	6.6	6.2	3.9	10.3	3.9	7.3	6.1	6.1
Qwen2-VL-72B^†	11.8	3.5	4.6	4.0	2.9	4.2	4.5	5.0
Internlm3-8b	1.8	4.6	4.7	3.2	4.0	9.2	4.1	4.8
DeepSeek-vl2-small^†	3.1	1.8	1.8	4.5	0.0	0.3	4.8	1.7
THUDM-chatglm3-6b	0.9	2.3	0.0	0.7	0.9	2.0	0.9	1.2
Qwen2.5-Math-7B	1.4	1.7	0.0	2.1	0.0	1.5	1.9	1.0
DeepSeek-math-7b-rl	0.7	0.0	0.0	1.5	0.0	0.6	0.9	0.4

🔼 Table 3 presents a detailed performance comparison of various large language models (LLMs) across six core physics subfields. It shows the accuracy of each model on problems from Atomic Physics (AMO), Electromagnetism (E&M), Classical Mechanics (CM), Optics (Opt), Quantum Mechanics (QM), and Thermodynamics & Statistical Physics (Stats). The table distinguishes between proprietary and open-source models and indicates which models can handle multimodal problems (those including images). Models are ranked by their average performance on a held-out test set. This provides a comprehensive evaluation of the strengths and weaknesses of different LLMs in solving physics problems, highlighting the challenges that remain for even state-of-the-art models.
read the caption
Table 3: Performance comparison across tasks. †: These models are equipped with multi-model abilities. Problems with images are also tested on these models. Abbreviations: AMO (Atomic Physics) | E&M (Electromagnetism) | CM (Classical Mechanics) | Opt. (Optics) | QM (Quantum Mechanics) | Stats. (Theromodynamics and Statistical Physics). The models are ranked by average test set performance.

ID	Year	Major	Assigned Subject(s)
1	2nd Year Undergraduate	Physics	Classical Mechanics, Electromagnetism
2	3rd Year Undergraduate	Physics	Quantum Mechanics, Optics
3	2nd Year Undergraduate	Theoretical Physics	Quantum Mechanics, Thermodynamics and Statistical Mechanics
4	3rd Year Undergraduate	Applied Physics	Thermodynamics and Statistical Mechanics, Atomic Physics
5	2nd Year Undergraduate	Engineering Physics	Electromagnetism, Classical Mechanics
6	3rd Year Undergraduate	Physics	Thermodynamics and Statistical Mechanics, Atomic Physics
7	2nd Year Undergraduate	Astrophysics	Classical Mechanics, Optics

🔼 Table 4 provides detailed biographical information for the seven expert annotators who contributed to the creation of the PHYSICS benchmark dataset. For each annotator, the table lists their ID number, academic year (as an undergraduate student), their major, and the specific physics subfields they were responsible for annotating. This information is crucial for understanding the qualifications and expertise of the individuals who created the dataset, thereby ensuring the reliability and accuracy of the benchmark’s annotation.
read the caption
Table 4: Biographies of 7 annotators involved in the Physics benchmark construction

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

Physics FM Limits#

PHYSICS Dataset#

Automated Eval#

RAG for Physics#

Multi-Modal Data#

More visual insights#

Full paper#