SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers

2502.20545

Kechen Li et el.

🤗 2025-03-03

TL;DR
#

Large Language Models (LLMs) have shown impressive abilities, but struggle with rigorous mathematical reasoning. Determining if a polynomial is non-negative, crucial for optimization, remains a challenge. This paper introduces SoS-1K, a dataset of 1,000 polynomials, with expert-designed instructions based on five increasingly difficult criteria. Evaluating state-of-the-art LLMs reveals poor performance without guidance, hovering near random guessing. LLMs need explicit help to perform this math task.

The study demonstrated that providing high-quality reasoning instructions significantly boosts LLM accuracy. A fine-tuned 7B model, SoS-7B, outperforms much larger models. Findings highlight LLMs’ potential to push mathematical reasoning boundaries and tackle computationally hard (NP-hard) problems. Code is available at the project’s github.

Key Takeaways
#

Why does it matter?
#

This paper pioneers the use of LLMs in solving complex mathematical problems like determining polynomial non-negativity. It highlights the potential of AI to advance mathematical research and tackle NP-hard problems, paving the way for future advancements. The meticulously curated SoS-1K dataset will serve as a valuable resource for future research.

Visual Insights
#

🔼 This figure compares three different prompting methods used to guide large language models (LLMs) in determining whether a given polynomial can be expressed as a sum of squares (SoS). SoS Plain shows a simple question with no additional guidance. SoS Simple provides a structured classification based on simple criteria, while SoS Reasoning offers a detailed, step-by-step approach with progressively challenging criteria and expert-designed reasoning instructions. The figure highlights the increasing level of detail and guidance provided to the LLM, demonstrating how the structured prompts improve the accuracy and reasoning process.
read the caption
Figure 1: Demonstration of SoS Plain (left), SoS Simple (mid), and SoS Reasoning (right).

Model	Accuracy on Valid Samples			Accuracy on Total Samples			Response Time (s)
Instruction Type	SoS Plain	SoS Simple	SoS Reasoning	SoS Plain	SoS Simple	SoS Reasoning	SoS Plain	SoS Simple	SoS Reasoning
General-purpose LLMs
Qwen2.5-7B-Instruct	55%	61%	76%	52%	59%	62%	22.4	31.2	68.5
Qwen2.5-7B-Instruct-1M	54%	64%	75%	54%	64%	63%	5.6	8.4	35.2
Qwen2.5-14B-Instruct	55%	66%	74%	52%	66%	69%	12.9	23.1	48.3
Qwen2.5-14B-Instruct-1M	56%	60%	74%	56%	59%	67%	12.7	20.7	52.7
Qwen2.5-32B-Instruct	56%	58%	67%	55%	58%	62%	13.0	18.0	37.4
DeepSeek-V3	54%	60%	70%	54%	60%	69%	29.6	39.8	95.0
GPT-4o-mini	59%	67%	72%	59%	67%	69%	10.8	15.4	53.1
GPT-4o	60%	61%	75%	59%	61%	75%	14.6	16.2	27.8
Reasoning-purpose LLMs
QwQ-32B-Preview	64%	71%	79%	44%	54%	52%	105.7	101.8	100.0
DeepSeek-R1	62%	62%	81%	55%	55%	56%	514.5	565.6	492.5
OpenAI o1-mini	58%	61%	77%	57%	61%	76%	8.3	18.1	34.9
Average	57%	63%	75%	54%	60%	65%	68.2	78.0	95.0

🔼 This table presents a comparison of the accuracy of various state-of-the-art Large Language Models (LLMs) on a subset of the SoS-1K dataset. The SoS-1K dataset contains approximately 1000 polynomials, and the subset used here contains 340 randomly selected polynomials. The accuracy is evaluated using three different prompting methods (SoS Plain, SoS Simple, and SoS Reasoning), which provide progressively more structured guidance to the LLMs. The table shows the accuracy achieved by each LLM on both valid samples (where the LLM produced a response within the time limit) and total samples. The results reveal the performance difference between general-purpose and reasoning-focused LLMs, highlighting the impact of instruction quality on LLM accuracy for this type of complex mathematical reasoning problem.
read the caption
Table 1: Accuracy comparison across various SOTA LLMs on a subset of SoS-1K with 340 samples. Results are divided into “Valid Samples” and “Total Samples”, as we found that LLMs sometimes suffer from timeout issues.

In-depth insights
#

SoS1: O1-R1 LLMs
#

The paper introduces SoS-1K, a novel dataset designed to evaluate the mathematical reasoning capabilities of Large Language Models (LLMs) specifically in the domain of sum-of-squares (SoS) problems. It investigates whether LLMs can determine if a given multivariate polynomial is nonnegative, a computationally intractable problem with applications in various fields. The research further explores whether the promise of test-time scaling can extend to research-level mathematics, an area largely unexplored. It is designed to probe the capacity of LLMs like Openai 01 and DeepSeek-R1 to solve large-scale SoS programming problems. This involves a carefully constructed dataset of approximately 1,000 polynomials, accompanied by expert-designed reasoning-guiding instructions based on five progressively challenging criteria and is aimed at pushing mathematical reasoning of LLMs.

NP-Hard SoS-1K
#

SoS-1K addresses the NP-hard problem of determining polynomial non-negativity. Directly tackling NP-hard challenges positions LLMs at the forefront of mathematical problem-solving. The implication is LLMs can be used in traditionally intractable optimization problems. The ‘1K’ suggests a dataset scale aiming to stress-test LLMs beyond toy examples, benchmarking their ability in realistic problems. This is because real-world mathematical problems often have an NP-hard structure. The use of SoS offers a structured path through the complexity, guiding the LLMs with a mathematically principled approach. This shows how LLMs can handle a specific class of NP-hard problems, which is the one that is related to sum of squares.

LLMs for SoS?
#

The paper investigates the potential of Large Language Models (LLMs) in tackling Sum of Squares (SoS) problems, a computationally challenging area within polynomial optimization closely related to Hilbert’s Seventeenth Problem. The research introduces SoS-1K, a curated dataset of approximately 1,000 polynomials designed to evaluate LLMs’ reasoning abilities. The study reveals that LLMs struggle with SoS problems without explicit, structured guidance, highlighting the importance of high-quality, expert-designed instructions. A key finding is that providing LLMs with detailed reasoning traces significantly improves accuracy, demonstrating their latent capacity for mathematical reasoning when effectively prompted. Furthermore, the research shows that fine-tuning a smaller 7B model on the SoS-1K dataset can outperform much larger models like DeepSeek-V3 and GPT-40-mini in accuracy and computation time, suggesting potential for efficient application of LLMs in mathematical problem-solving.

Reasoning boost
#

Reasoning boost in LLMs is pivotal for tackling complex mathematical tasks like determining if a multivariate polynomial is a sum of squares (SoS). The study reveals that without structured guidance, LLMs perform poorly. High-quality reasoning instructions significantly improve accuracy, boosting performance considerably. This suggests LLMs possess underlying knowledge but need structured instructions to retrieve and apply it effectively. Reasoning-focused LLMs generally outperform general-purpose ones, emphasizing the importance of reasoning capabilities. Higher-capacity models require fewer thinking tokens, while lower-capacity models need more reasoning steps to achieve optimal performance. Supervised fine-tuning further enhances accuracy and reduces response times, highlighting the potential of LLMs to push mathematical reasoning boundaries and tackle NP-hard problems.

Future of AI SoS
#

The “Future of AI SoS” is ripe with potential, especially considering the advancements highlighted in the paper. AI’s role in Sum-of-Squares (SoS) problem-solving could revolutionize mathematical research, making NP-hard problems more tractable. Future research should focus on enhancing LLMs’ reasoning capabilities further, perhaps by incorporating more structured knowledge or developing specialized architectures. The trend of test-time scaling shows promise, suggesting that LLMs can generate more complex solutions with additional computational resources. Addressing current limitations, such as handling longer polynomials and ensuring the validity of LLM decisions, is crucial. Datasets like SoS-1K serve as valuable benchmarks, but expanding them to encompass even more challenging problems will be essential. The potential for AI to not only solve but also generate new mathematical insights, as demonstrated by Qwen-14B-1M’s NNSoS example, signals a paradigm shift in mathematical exploration. However, ethical considerations regarding AI’s role in mathematical discovery needs careful consideration to ensure credibility.

More visual insights
#

More on tables

Model	Size	Acc. (%)
Closed Source
GPT-4o	—	75
o1-mini	—	76
Open Source
Qwen2.5-7B-Instruct-1M	117B	63
Qwen2.5-32B-Instruct	132B	62
QwQ-32B-Preview	132B	52
DeepSeek-V3	671B	69
DeepSeek-R1	671B	56
SoS-7B (Ours)	117B	70

🔼 This table presents the accuracy of various Large Language Models (LLMs) on the Sum of Squares (SoS) Reasoning benchmark. It compares the performance of both closed-source and open-source models of different sizes, indicating their accuracy in determining whether a given polynomial is an SoS. The accuracy is calculated based on the total number of evaluation samples and not just a subset, providing a comprehensive assessment of model capabilities.
read the caption
Table 2: Accuracy Comparison on SoS Reasoning Benchmark, where '—' denotes the undisclosed model size. Accuracy is measured on full evaluation samples.

Polynomial Type	length<4000	length>4000	Total	Is it SoS?	Difficulty
Test Set 1: Odd Degree Polynomial	150	50	200	NO	Easy
Test Set 2a: SoS (Expanded Form)	69	51	120	YES	Hard
Test Set 2b: Negative (Expanded Form)	23	40	63	NO	Hard
Test Set 2.1a: SoS (Squared Form)	105	15	120	YES	Easy
Test Set 2.1b: Negative (Squared Form)	38	25	63	NO	Easy
Test Set 3.1a: Nonnegative Quadratic Quartic	100	0	100	YES	Medium
Test Set 3.1b: Negative Quadratic Quartic	100	0	100	NO	Medium
Test Set 3.2a: Nonnegative Quartic with 2 variables	100	0	100	YES	Medium
Test Set 3.2b: Negative Quartic	100	0	100	NO	Medium
Test Set 4a: Nonnegative Quadratic Quartic	100	0	100	YES	Medium
Test Set 4b: Negative Quartic	100	0	100	NO	Medium
Test Set 5.1a: PSD Q	80	16	96	YES	Hard
Test Set 5.1b: Non-PD Q	80	16	96	NO	Hard
Test Set 5.2a: PSD Spare Q (Sparsity 0.1)	56	16	72	YES	Hard
Test Set 5.2b: Non-PD Spare Q (Sparsity 0.1)	56	16	72	NO	Hard
Test Set 5.3a: PSD Low Rank Q (rank 3)	42	18	60	YES	Hard
Test Set 5.3b: Non-PD Low Rank Q (rank 3)	28	12	40	NO	Hard
Test Set 5.4a: PSD Ill-Conditioned Q ( $\lambda=1-10^{12}$ )	20	15	35	YES	Hard
Test Set 5.4b: Non-PD Ill-Conditioned Q	40	30	70	NO	Hard

🔼 Table 3 provides a summary of the SoS-1K dataset, which consists of approximately 1000 polynomials categorized into different subsets based on characteristics like polynomial degree, presence of specific structures, and difficulty level. Each subset is labeled to indicate whether its polynomials are sums of squares (SoS) or not, and the number of polynomials in each subset. Additionally, the table shows the length distribution of polynomials (less than 4000 characters or more) within each subset and indicates the difficulty level of each subset.
read the caption
Table 3: Summary of SoS-1K Test Sets.

Model	# Total Samples	# Valid Samples
Instruction Type		SoS Plain	SoS Simple	SoS Reasoning
DeepSeek-R1	340	300	302	234
QWQ-32b	340	233	259	225
o1-mini	340	338	340	337
Qwen2.5-7b	340	323	332	277
GPT-4o	340	336	338	337
Qwen2.5-7b-1m	340	340	339	286
Qwen2.5-14b	340	325	340	316
Qwen2.5-14b-1m	340	340	336	309
GPT-4o-mini	340	339	339	327
DeepSeek-V1	340	340	340	332
Qwen2.5-32b	340	334	339	315
Average	340	323	328	300

🔼 This table presents a comparison of the performance of various Large Language Models (LLMs) on a subset of the SoS-1K dataset. For each model, it shows the total number of samples tested, and the number of valid samples (i.e., samples for which the model produced a response within the time limit). The table further breaks down the valid samples according to which prompting method was used (SoS Plain, SoS Simple, and SoS Reasoning). This allows for an assessment of how different prompting strategies impact the accuracy of each model’s predictions regarding the sum-of-squares property of the polynomials.
read the caption
Table 4: # Total Samples and Valid Samples of Different Models.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

SoS1: O1-R1 LLMs#

NP-Hard SoS-1K#

LLMs for SoS?#

Reasoning boost#

Future of AI SoS#

More visual insights#

Full paper#