Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

2503.21380

Haoxiang Sun et el.

🤗 2025-03-28

TL;DR
#

Existing benchmarks for evaluating LLMs’ mathematical reasoning are becoming saturated. The rapid development of LLMs necessitates more challenging and rigorous evaluations to identify their limitations and guide future improvements. Current benchmarks struggle to differentiate capabilities and multilingual assessment is also lacking.

This paper introduces OlymMATH, a new Olympiad-level math benchmark to rigorously test LLMs’ reasoning. OlymMATH has 200 problems with varying difficulty, manually verified and available in both English and Chinese. Experiments show that models like DeepSeek-R1 struggle with OlymMATH, highlighting its difficulty. The benchmark facilitates bilingual assessment and the analysis helps understand limitations in solving complex math problems.

Key Takeaways
#

Why does it matter?
#

This paper introduces a new Olympiad-level math benchmark for LLMs, addressing limitations in existing benchmarks. It offers rigorous evaluations, bilingual assessment, and insights into models’ reasoning abilities, opening new research avenues to improve LLMs in mathematical problem-solving.

Visual Insights
#

🔼 This figure compares the performance of several large language models (LLMs) on various Olympiad-level mathematics benchmarks, including the OlymMATH benchmark (English version) introduced in this paper. The x-axis shows the different models tested, and the y-axis displays their accuracy (or percentile) on each benchmark. The bars are grouped by benchmark: OlymMATH-EN-EASY, OlymMATH-EN-HARD (the new benchmark proposed in the paper), and other existing Olympiad-level benchmarks like AIME 2024, AIME 2025, and HMMT 202502. The figure highlights that the OlymMATH-HARD subset is significantly more challenging than existing benchmarks, with even the state-of-the-art models showing considerably lower accuracy on this subset.
read the caption
Figure 1: Performance comparisons of mainstream reasoning models between our OlymMATH (English version) and other Olympiad-level mathematical benchmarks. OlymMATH-HARD emerges as the most challenging, with significantly higher difficulty than existing evaluation benchmarks.

🔼 This table compares various mathematical reasoning benchmarks used to evaluate large language models (LLMs). It lists the name of each benchmark, the number of problems it contains, the number of mathematical fields covered, the languages supported (English, Chinese, or both), the evaluation method used (rule-based or LLM-based), and the difficulty level of the problems (grade school, competition, university, or Olympiad). This allows for a comparison of benchmark scale, scope, and difficulty.
read the caption
Table 1: Comparison of different math evaluation benchmarks. EN and ZH denote English and Chinese, respectively. “Univ.” denotes university-level problems.

In-depth insights
#

OlymMATH Intro
#

The introduction of OlymMATH addresses a critical gap in evaluating LLMs’ mathematical reasoning. Existing benchmarks are becoming saturated, failing to challenge advanced models effectively. OlymMATH, an Olympiad-level benchmark, aims to rigorously test complex reasoning. Its key features include meticulously curated problems, manual verification, parallel English and Chinese versions for comprehensive bilingual assessment, and two distinct difficulty tiers: AIME-level (easy) and significantly more challenging (hard). The problems span four core mathematical fields, each with verifiable numerical solutions for objective evaluation. Initial results show state-of-the-art models struggle with OlymMATH, particularly the hard subset, indicating its higher difficulty and potential to push the boundaries of mathematical reasoning evaluation.

LLM Reasoning Gap
#

LLM reasoning gaps signify the disparities between expected and actual performance in complex tasks. These gaps often manifest in areas requiring multi-step inference, nuanced understanding of context, and the ability to handle ambiguity. Current LLMs, while excelling in pattern recognition, often struggle with causal reasoning, analogical thinking, and counterfactual simulations, exhibiting limitations in systematic generalization and transfer learning. Addressing these gaps necessitates innovations in model architectures, training methodologies, and evaluation benchmarks to better assess and enhance true reasoning capabilities. The focus should be on developing models that can robustly handle novel situations and exhibit more human-like reasoning proficiency.

Bilingual Analysis
#

A bilingual analysis of mathematical reasoning models offers a potent lens for evaluating their capabilities. By assessing performance across both English and Chinese, we can discern the influence of language on problem-solving efficacy. Differences in linguistic structure, cultural context embedded in problems, and the nuances of mathematical terminology across languages can all contribute to variations in a model’s accuracy and efficiency. Furthermore, a bilingual approach helps mitigate biases inherent in datasets predominantly available in one language, fostering more robust and equitable evaluations. Observing performance gaps highlights the need for enhanced multilingual training, ensuring that models generalize effectively across diverse linguistic landscapes and achieve true cross-cultural reasoning proficiency.

Dataset Details
#

In analyzing the dataset details, several key aspects emerge. First, the paper emphasizes the high-quality and original nature of the Olympiad-level problems, meticulously curated from printed resources to minimize data contamination, with problems manually sourced from specialized magazines, textbooks, and official competition materials. The dataset spans four primary mathematical fields—algebra, geometry, number theory, and combinatorics—to ensure comprehensive coverage of reasoning skills. Problems are categorized into two difficulty levels: easy and hard, with the former challenging standard prompting approaches and the latter testing advanced reasoning capabilities, such as slow-thinking modes. The dataset adheres to a format aligning with the MATH dataset, ensuring seamless integration with existing benchmarks and evaluation pipelines. It restricts answers to real numbers and intervals and also deliberately excludes other answer formats such as set operations, variables, complex numbers, and free-form text for easy evaluation. Furthermore, it specifically curates problems that yield challenging numerical answers and provides a summary of all potential outcomes, such as sums, sums of squares, and more.

New benchmark
#

When introducing a new benchmark, it’s crucial to establish its uniqueness and value proposition compared to existing ones. This involves highlighting specific gaps it addresses, like handling complex reasoning or multilingual capabilities. A new benchmark’s design should minimize biases and ensure reliability through rigorous problem curation and validation. Additionally, its difficulty level must be calibrated to challenge current models without being insurmountable. The benchmark’s format and evaluation metrics should be clearly defined and easily reproducible, facilitating broader adoption and comparison. Also, the diversity of problem types is key to ensure a variety of scenarios. Finally, releasing a new benchmark should be accompanied by an analysis of its performance against state-of-the-art models, demonstrating its effectiveness in differentiating model capabilities and guiding future research.

More visual insights
#

More on tables

🔼 This table presents a geometry problem from the OlymMATH dataset. The problem is described entirely in text, illustrating the dataset’s focus on textual problem representation, even for subjects like geometry that traditionally rely on diagrams. This exemplifies the OlymMATH design principle of ensuring that all problems are amenable to evaluation via language models, facilitating a fair and consistent assessment of LLMs’ reasoning capabilities.
read the caption
Figure 3: A geometry problem described precisely in text from OlymMATH.

🔼 This table shows the distribution of the 200 problems in the OlymMATH benchmark across four mathematical categories (Algebra, Geometry, Number Theory, and Combinatorics) and two difficulty levels (Easy and Hard). It provides the counts of problems for each category and difficulty level combination, and the total number of problems in each category.
read the caption
Table 2: The distribution of contest problems by category.

Models	Pass@1					Cons@10
Models	Alg.	Geo.	Num.	Com.	Avg.	Alg.	Geo.	Num.	Com.	Avg.
The Hard Subset
Qwen2.5-32B-R1D	17.2	18.8	17.6	0.8	13.6	36.0	20.0	24.0	0.0	20.0
QwQ-32B	25.6	18.4	24.8	5.2	18.5	36.0	12.0	24.0	4.0	19.0
DeepSeek-R1	32.8	28.8	19.2	4.0	21.2	48.0	36.0	24.0	4.0	28.0
o3-mini (high)	24.0	32.7	44.7	20.0	30.3	/	/	/	/	/
The Easy Subset
Qwen2.5-32B-R1D	73.6	69.1	88.5	48.3	66.7	92.0	97.0	92.3	58.6	84.0
QwQ-32B	84.8	79.7	83.8	67.9	78.1	96.0	100.0	92.3	79.3	92.0
DeepSeek-R1	88.4	84.8	90.8	62.4	80.0	100.0	97.0	92.3	75.9	91.0
o3-mini (high)	90.0	89.4	100.0	85.1	89.7	/	/	/	/	/

🔼 This table details the acceptable and unacceptable answer formats for the OlymMATH benchmark. Acceptable formats are limited to single real numbers, intervals (ranges), and sets of numbers. Unacceptable formats include set operations (union, intersection, etc.), variables (symbols representing unknown values), complex numbers, and any text-based answers. This restriction ensures consistency and ease of automated evaluation.
read the caption
Table 3: The included and excluded formats of the final answer.

Models	Pass@1					Cons@10
Models	Alg.	Geo.	Num.	Com.	Avg.	Alg.	Geo.	Num.	Com.	Avg.
The Hard Subset
Qwen2.5-32B-R1D	5.6	6.8	8.0	1.2	5.4	4.0	8.0	12.0	0.0	6.0
QwQ-32B	19.6	12.4	15.2	1.2	12.1	16.0	12.0	24.0	0.0	13.0
DeepSeek-R1	21.6	24.4	16.8	1.6	16.1	20.0	36.0	20.0	0.0	19.0
o3-mini (high)	22.0	26.0	48.7	12.0	27.2	/	/	/	/	/
The Easy Subset
Qwen2.5-32B-R1D	45.2	40.3	54.6	23.4	38.5	56.0	57.6	61.5	31.0	50.0
QwQ-32B	76.8	73.3	87.7	48.6	68.9	92.0	97.0	92.3	62.1	85.0
DeepSeek-R1	79.2	74.8	93.1	50.7	71.3	92.0	90.9	100.0	65.5	85.0
o3-mini (high)	88.0	92.4	96.2	79.9	88.2	/	/	/	/	/

🔼 This figure displays a sample problem from the OlymMATH-HARD subset of the benchmark. It’s designed to evaluate whether large language models (LLMs) can correctly identify and provide all possible solutions to a complex mathematical problem, rather than just finding a single valid answer. The problem’s complexity and the requirement to find all possible answers make this a rigorous test of the model’s true reasoning capabilities.
read the caption
Figure 4: An example from OlymMATH-HARD designed to assess whether the model can identify all possible answers.

Benchmarks	o3-mini (high)	DeepSeek-R1	QwQ-32B	Qwen2.5-32B-R1D
MATH-500	/	97.3^∗	/	94.3^∗
AIME 2024	87.3^∗	79.8^∗	79.5^∗	72.6^∗
AIME 2025	86.5^∗	70.0^∗	66.0^∗	59.0^∗
HMMT 202502	67.5^∗	41.7^∗	47.5^∗	33.3^∗
OlymMATH-EN-HARD	30.3	21.2	18.5	13.6
OlymMATH-EN-EASY	89.7	80.0	78.1	66.7
OlymMATH-ZH-HARD	27.2	16.1	12.1	5.4
OlymMATH-ZH-EASY	88.2	71.3	68.9	38.5

🔼 This table shows an example of the structured format used to present problems and answers in the OlymMATH dataset. It demonstrates the parallel presentation of problems and solutions in both English and Chinese, illustrating the bilingual nature of the dataset. The table includes problem statements, answers, and subject categories, highlighting the consistency in data structure across languages.
read the caption
Figure 5: Example of the format used in OlymMATH.

Dataset	Avg. Output Length			Avg. Reasoning Length
Dataset	Correct	Incorrect	Total	Correct	Incorrect	Total
AIME 2024	1591	1707	1617	19958	50836	26820
AIME 2025	1547	1625	1575	23350	53006	33894
OlymMATH-EN-HARD	1608	1610	1610	39024	43850	42827
OlymMATH-EN-EASY	1588	1594	1589	29132	46008	32507

🔼 Table 4 presents the performance of various large language models (LLMs) on the English version of the OlymMATH benchmark. The benchmark contains problems from four mathematical areas: Algebra, Geometry, Number Theory, and Combinatorics. The table shows the Pass@1 and Cons@10 scores for each model and category. Pass@1 indicates the accuracy of the top-ranked answer provided by the model, while Cons@10 represents the accuracy of the majority consensus answer among the top 10 model responses. Due to limitations with the 03-mini model’s API, only three responses per question were sampled, preventing a Cons@10 score from being reported for that model.
read the caption
Table 4: Model performance on the English version of OlymMATH. The abbreviations “Alg.”, “Geo.”, “Num.”, and “Com.” represent the four problem categories: Algebra, Geometry, Number Theory, and Combinatorics, respectively. Due to o3-mini’s API instability, we sampled only three responses per question and cannot report its Cons@10 performance.

🔼 Table 5 presents the performance of several large language models (LLMs) on the Chinese version of the OlymMATH benchmark. The benchmark consists of 200 Olympiad-level math problems categorized into four areas: Algebra, Geometry, Number Theory, and Combinatorics. The table shows the models’ Pass@1 (accuracy of the top prediction) and Cons@10 (accuracy of the consensus answer across 10 predictions) scores for each category and overall. Due to limitations in the o3-mini model’s API, only three responses were sampled per question for this model, and its Cons@10 performance is therefore not reported.
read the caption
Table 5: Model performance on the Chinese version of OlymMATH. The abbreviations “Alg.”, “Geo.”, “Num.”, and “Com.” represent the four problem categories: Algebra, Geometry, Number Theory, and Combinatorics, respectively. Due to o3-mini’s API instability, we sampled only three responses per question and cannot report its Cons@10 performance.

🔼 This table compares the performance of four large language models (LLMs) across five different mathematical reasoning benchmarks: MATH-500, AIME 2024, AIME 2025, HMMT 202502, and OlymMATH (both easy and hard versions in English and Chinese). The Pass@1 metric indicates the percentage of problems correctly solved by the model in a single attempt. Results marked with an asterisk (*) were taken from the original publications or the MathArena website, not directly from the authors’ experiments. The table highlights the relative difficulty of each benchmark and how the models’ performance varies across them.
read the caption
Table 6: Model performance on different mathematical benchmarks (Pass@1). The results with (*) denote that they are copied from their reports or the third-party website MathArena.

🔼 This table presents a quantitative analysis of the DeepSeek-R1 model’s response length across four different mathematical reasoning benchmarks: AIME 2024, AIME 2025, OlymMATH-EN-HARD, and OlymMATH-EN-EASY. For each benchmark, it shows the average length of the model’s final answer (Output Length) and the average length of its reasoning process (Reasoning Length). These lengths are further broken down into the average lengths for correct and incorrect answers, providing insight into the relationship between response length and solution accuracy. The data reveals how the model’s output and thought processes vary in length based on the complexity and nature of the problems within each benchmark.
read the caption
Table 7: The average length of the responses generated from DeepSeek-R1. “Output Length” refers to the length of the final solution provided by the model, while “Reasoning Length” indicates the length of model’s thought process.

🔼 This figure displays an example problem encountered during the OlymMATH dataset creation. The problem involves finding the maximum area of a triangle given a constraint on its side lengths. The model, 03-mini (high), arrived at the correct answer; however, its solution process lacked mathematical rigor. Due to space limitations, only a concise summary of the model’s solution steps is presented, highlighting the model’s non-rigorous approach.
read the caption
Figure 7: An example during our data collection. o3-mini (high) found the correct answer without rigorously solving the problem. Due to length constraints, we provide only a summary of its original solution steps.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

OlymMATH Intro#

LLM Reasoning Gap#

Bilingual Analysis#

Dataset Details#

New benchmark#

More visual insights#

Full paper#