Benchmarking AI Models in Software Engineering: A Review, Search Tool, and Enhancement Protocol

2503.05860

Roham Koohestani et el.

🤗 2025-03-12

TL;DR
#

The integration of AI into Software Engineering (AI4SE) has led to numerous benchmarks for tasks like code generation. However, this surge presents challenges such as scattered benchmark knowledge, difficulty in selecting relevant benchmarks, absence of a uniform standard for benchmark development, and limitations of existing benchmarks. Addressing these issues is critical for accurate evaluation and comparison of AI models in software engineering.

This paper introduces BenchScout, a semantic search tool, to find relevant benchmarks and BenchFrame, a unified method to enhance benchmark quality. BenchFrame is applied to the HumanEval benchmark, leading to HumanEvalNext, featuring corrected errors, improved language conversion, expanded test coverage, and increased difficulty. The paper then evaluates ten state-of-the-art code language models on HumanEvalNext.

Key Takeaways
#

Why does it matter?
#

This paper is important for researchers because it addresses critical gaps in AI4SE benchmark quality and accessibility. The BenchFrame and BenchScout tools offers an approach to enhance existing benchmarks and streamline benchmark selection which ensures more reliable and relevant evaluations of AI models.

Visual Insights
#

Category	Name	Language(s)	# Tests
Original	HumanEval [HumanEval-2021]	Python	Avg. 7.7
Improved Language Support	MultiPL-HumanEval [MultiPL-E-2022]	18 programming languages	Avg. 7.7
	HumanEval-Fix [HumanEvalPack-2023]	6 programming languages	Avg. 7.7
	HumanEval-Explain [HumanEvalPack-2023]	6 programming languages	Avg. 7.7
	HumanEval-Synthesize [HumanEvalPack-2023]	6 programming languages	Avg. 7.7
	HumanEval-X [HumanEval-X-2023]	5 programming languages	Avg. 7.7
	Multi-HumanEval [MultiEval-HumanEval-MBXP-MathQA-X-2022]	12 programming languages	Avg. 7.7
	HumanEvalXL [peng_humaneval-xl_2024]	12PLs, 23NLs	Avg. 8.33
Improved Testing	HumanEval+ [HumanEvalPlus-Mini-MBPP+-2023]	Python	Scaled ×80
	HumanEval-MINI [HumanEvalPlus-Mini-MBPP+-2023]	Python	Scaled ×47
	HE-Eval [CodeScore-HE-APPS-MBPP-Eval-2023]	Python	Scaled ×14
Instruction-based	InstructHumanEval ¹¹1https://huggingface.co/datasets/codeparrot/instructhumaneval	Python	Avg. 7.7
Extended	EvoEval [EvoEval-2024]	Python	Multiple categories,
Extended	EvoEval [EvoEval-2024]	Python	scaled with EvalPlus

🔼 This table presents an overview of various AI4SE (Artificial Intelligence in Software Engineering) benchmarks that originated from the HumanEval benchmark. It categorizes these benchmarks based on modifications made to the original HumanEval, such as improvements to language support, testing methodologies, and overall expansion of the benchmark. For each benchmark, the table lists its name, the programming languages it supports, and the number of tests or problems it contains.
read the caption
TABLE I: Overview of AI4SE benchmarks stemming from HumanEval [HumanEval-2021].

In-depth insights
#

AI4SE Review
#

Based on the review, AI4SE benchmarks exhibit a growing trend, highlighting the increasing integration of AI in software engineering. The review identifies key limitations, including scattered benchmark knowledge, difficulty in selection, absence of uniform standards, and inherent flaws. The review process involved a systematic search, credibility verification, and taxonomy development to categorize benchmarks, extract metadata, and address the challenges in evaluating AI models for code generation, repair, and understanding. BenchFrame introduces a unified approach for benchmark enhancement, as demonstrated through the HUMANEVALNEXT case study, addressing issues like incorrect tests and insufficient test coverage. This framework serves as a guiding light for improving the methodology and shedding light on limitations in existing AI4SE benchmarks, paving the way for better evaluation and advancement of AI in software engineering practices.

BenchScout Tool
#

BenchScout is a tool designed to address the challenge of locating relevant AI4SE benchmarks. Given the abundance of these benchmarks, finding the most suitable one for a specific software engineering task can be difficult. BenchScout aims to systematically and semantically search existing benchmarks and their corresponding use cases. It seeks to visually evaluate the closeness and similarity of benchmark groups. It finds the relations between citing bodies to identify patterns relevant to different use cases. The tool aims to map unstructured textual content of papers to a semistructured domain using pre-trained text embedding models. It applies dimensionality reduction to create a 2D representation that’s easy to interpret and uses clustering techniques to assess similarity. The interactive interface is to allow users to explore clusters. BenchScout enhances search through features like text-based search and a paper content tooltip. A user study indicates high usability.

BenchFrame Qlty
#

BenchFrame aims to improve benchmark quality for AI in Software Engineering (AI4SE). It likely addresses crucial aspects like correcting errors, improving language conversion, and expanding test coverage. The lack of standardization in benchmark development can lead to inconsistent evaluations and hinder progress. BenchFrame probably provides a structured methodology for refining benchmarks, ensuring they are robust and reliable. This is essential for accurately assessing model performance, preventing data leakage, and promoting fair comparisons across different approaches. By focusing on the practical aspects of enhancing benchmark quality, BenchFrame likely serves as a valuable tool for researchers and practitioners in the AI4SE field.

HumanEvalNext
#

HumanEvalNext is presented as an enhanced version of the original HumanEval benchmark, addressing limitations such as incorrect tests and suboptimal solutions. Modifications include fixing canonical solutions, adding type annotations for improved language conversion support, and incorporating challenging scenarios (negative values, edge cases) to prevent overfitting. Assertions are implemented within the code to prevent models from ignoring crucial details. Test examples are refined, and spelling errors corrected. The independent peer review confirms the enhancements’ robustness while refining its quality.

Bench AI Limit
#

The concept of ‘Bench AI Limit’ likely refers to the inherent constraints and shortcomings of using AI-driven benchmarks in fields like software engineering (AI4SE). This includes limitations in scope, where benchmarks may not fully capture the complexity of real-world tasks, leading to an overestimation of AI capabilities. Another aspect is data contamination, where training datasets inadvertently include benchmark data, artificially inflating performance scores and hindering accurate evaluation. Benchmark saturation is also a concern, as models become increasingly adept at solving existing benchmarks, necessitating continuous development of more challenging and diverse benchmarks to truly reflect AI progress. The absence of standardized benchmark development practices is another factor. Addressing these limits is essential for ensuring benchmarks effectively guide innovation and provide reliable assessments of AI systems.

More visual insights
#

More on tables

Category	Name	Language(s)	# Problems
Original	MBPP [MBPP-MathQA-2021]	Python	974
Improved Language Support	MultiPL-MBPP [MultiPL-E-2022]	18 programming languages	354-397 per language
Improved Language Support	MBXP [MultiEval-HumanEval-MBXP-MathQA-X-2022]	13 programming languages	848-974 per language
Improved Testing	MBPP+ [HumanEvalPlus-Mini-MBPP+-2023]	Python	427
Improved Testing	MBPP-Eval [CodeScore-HE-APPS-MBPP-Eval-2023]	Python	974

🔼 This table presents a comprehensive overview of AI4SE (Artificial Intelligence in Software Engineering) benchmarks derived from the MBPP (Mostly Basic Python Problems) benchmark. It categorizes these benchmarks based on their origin (original, improved language support, or improved testing) and provides the name of each benchmark along with the programming language(s) it supports and the number of problems included.
read the caption
TABLE II: Overview of AI4SE benchmarks derived from MBPP [MBPP-MathQA-2021].

Category	Name	Language(s)	# Tests
Competitive Programming	CodeContests [CodeContests-AlphaCode-2022]	12 programming languages	Avg. 203.7
	APPS [APPS-2021]	Python	Avg. 13.2
	LiveCodeBench [jain_livecodebench_2024]	Python	Avg. 17.23
	LeetCode [tian_is_2023]	Python	Avg. 135
	CodeElo [quan_codeelo_2025]	N/A	408 problems
Code Complexity	CoRCoD [CoRCoD-2019]	Java	932
	GeeksForGeeks (GFG) [GFG-2023]	C++, Python	±1,400 per lang.&categ
	CODAIT ²²2CODAIT-2021 https://ibm.co/4emPBIa	Python	4,000,000
	CodeComplex [CodeComplex-2022]	Java, Python	4,900 per language
	PythonSaga [PythonSaga-HumanEval-MBPP-Evaluation-Difficulty-2024]	Python	185
Code Efficiency	EffiBench [EffiBench-2024]	Python	Self-defined, avg. 100
	CODAL [m_weyssow_codeultrafeedback_2024]	Python	3 ref. / problem
	PIE [shypula_learning_2024]	C++	82.5(median, train)

🔼 This table presents a collection of benchmarks categorized into three groups based on their focus: competitive programming, code complexity, and code efficiency. Each benchmark includes the name, primary programming language(s) used, and the number of tests or problems included. This allows readers to quickly compare the characteristics of different benchmarking tools and select the most appropriate one for evaluating specific aspects of code quality and performance.
read the caption
TABLE III: Overview of competitive programming, code complexity, and code efficiency benchmarks.

Name	Language(s)	# Tests	Comment
DS-1000 [DS-1000-2022]	Python	Avg. 1.6	7 DS/ML libraries
NumpyEval [NumpyEval-PandasEval-PyCodeGPT-2022]	Python	Avg. 20 functions	NumPy (101 problems)
			(Avg. 1 variable)
PandasEval [NumpyEval-PandasEval-PyCodeGPT-2022]	Python	Avg. 20 functions	Pandas (101 problems)
			(Avg. 1 variable)
JuICe [JuICe-2019]	Python, JN	N/A	Cell completion
			(1.5M/3.7K train test)
DSP [DSP-2022]	Python, JN	Available	Cell completion
			(1,119 problems)
ExeDS [huang_junjie_execution-based_2022]	Python, JN	Execution Based	Cell generation
			(ground truth), 534 tasks
DSEval [zhang_benchmarking_2024]	Python	custom appraoch	Models Evaluated via the DSEval
			Approach from the Paper
Bio-Coder [xiangru_tang_biocoder_2024]	Python	1,026	Identify and import necessary
	Java	1,243	classes for given task
Bio-Coder-Rosalind [xiangru_tang_biocoder_2024]	Python	253 golden solution	Generate code for question
WebApp1k [cui2024webapp1kpracticalcodegenerationbenchmark]	React	Available	evaluates whether a model can
			generate React web-app

🔼 This table presents a compilation of benchmarks specifically designed for evaluating the performance of AI models in data science tasks and other domain-specific software engineering problems. It details each benchmark’s name, the programming languages involved, the number of tests or problems included, and additional comments to clarify specific aspects or limitations. The ‘JN’ notation indicates the use of Jupyter Notebooks within the benchmark.
read the caption
TABLE IV: Overview of data science & domain-specific benchmarks. (JN refers to Jupyter Notebooks.)

Name	Language(s)	# Problems
MATH [dan_hendrycks_measuring_2021]	English	12,500
MATH500 [lightman_lets_2023]	English	500
MathQA [MathQA-2019]	English	37,297
MathQA-Python [MBPP-MathQA-2021]	Python	23,914
MathQA-X [MultiEval-HumanEval-MBXP-MathQA-X-2022]	Python, Java, JS	1,883 per language
Līla[Lila-BHASKARA-2022]	Python	133,815 questions, 358,769 programs
MultiArith [MultiArith-2015]	English	600
GSM8K [PAL-GSM-Math-2022]	English	1,320
GSM-HARD [PAL-GSM-Math-2022]	English	1,320
TheoremQA [TheoremQA-2023]	English	800
PECC [patrick_haller_pecc_2024]	Python	1,006
BRIGHT [hongjin_su_bright_2024]	English	395
AMC12³³3https://huggingface.co/datasets/AI-MO/aimo-validation-amc	English	82

🔼 This table presents a list of various benchmarks used for evaluating the mathematical reasoning capabilities of AI models in the context of software engineering. For each benchmark, it lists the name, the languages it supports, the number of problems it contains, and additional notes or specifics. This provides a comprehensive overview of available resources for assessing AI’s mathematical reasoning skills within AI4SE.
read the caption
TABLE V: Overview of Mathematical Reasoning Benchmarks.

Category	Name	Language(s)	No. of Problems
Text2Code	CoNaLa [CoNaLa-2018]	English $\rightarrow$ Python	2,879
	MCoNaLa [MCoNaLa-2022]	{Spanish, Japanese, Russian} $\rightarrow$ Python	896
	CoNaLa-SO [orlanski_reading_2021]	Englihs $\rightarrow$ Python	10,000
	APPS [APPS-2021]	English $\rightarrow$ Python	10,000
	APPS-Eval [CodeScore-HE-APPS-MBPP-Eval-2023]	English $\rightarrow$ Python	10,000
	AixBench [AixBench-2022]	English, Chinese $\rightarrow$ Java	175
	Natural2Code [Gemini-Natural2Code-2023]	English $\rightarrow$ Python	Unknown
	CoSQA [CoSQA-2021]	English $\rightarrow$ Python	20,604
	WebQueryTest [CodeXGLUE-2021]	English $\rightarrow$ Python	1,046
	AdvTest [CodeXGLUE-2021]	English $\rightarrow$ Python	280,634
	CONCODE [Concode-2018]	English $\rightarrow$ Java	104,000
	MTPB [MTPB-CodeGen-2022]	English $\rightarrow$ Python	115
	CAASD [simiao_zhang_experimenting_2024]	English $\rightarrow$ Python	72
	Shellcode_IA32 [liguori_can_2022]	English $\rightarrow$ IA32/Shell	3200
	Odex [liguori_can_2022]	{Spanish, Japanese, Russian, English}	945 {90, 164, 252, 439}
		$\quad\quad\quad\rightarrow$ Python	1707 test total
	PSB2 [thomas_helmuth_psb2_2021]	English $\rightarrow$ {Clojure, Python}	25
			question-answer pairs
	TACO [rongao_li_taco_2023]	English $\rightarrow$ Python	1,539,152 on 26,433 distinct tasks
	Turbulence [shahin_honarvar_turbulence_2023]	English $\rightarrow$ Python	60 (with 420 total test cases)
	Aider ⁴⁴4https://github.com/Aider-AI/aider/blob/main/benchmark/README.md	English $\rightarrow$ {C++, GO, Java, JS, Python, Rust}	225 problems
	NL2ML-lib [shin_good_2024]	English $\rightarrow$ Python (ML libraries)	11,000
	RMCBench [chen_rmcbench_2024]	English $\rightarrow$ 9 Languages	473 malicious prompts
	Evil [liguori_evil_2021]	English $\rightarrow$ {Python, IA_32}	19255
Text2Text (about code)	InfiCoder-Eval [InfiCoder-Eval-2023]	English $\rightarrow$ English	270
Text2Text (about code)	BRIGHT [hongjin_su_bright_2024]	English $\rightarrow$ English	1,398
Code2Text	DeepCom [DeepCom-2018]	Java $\rightarrow$ English	588K
	Hybrid-DeepCom [hu_deep_2020]	Java $\rightarrow$ English	466k
	BinSum [BinSum-2023]	Binary functions $\rightarrow$ English	557K
	Code Attention [miltiadis_allamanis_convolutional_2016]	Java $\rightarrow$ English	11 projects
	Funcom [alexander_leclair_neural_2019]	Java $\rightarrow$ English	2.1M problems
	CodeSum [hu_summarizing_nodate]	Java $\rightarrow$ English	410,630
	CoDesc [hasan_codesc_2021]	Java $\rightarrow$ English	4.21M datapoints
	Parallel [barone_parallel_nodate]	Python $\rightarrow$ English	150k function/doc pais
	CoDocBench [pai2025codocbenchdatasetcodedocumentationalignment]	Python $\rightarrow$ English	4573 code/doc pairs

🔼 This table presents a comprehensive overview of benchmarks used for evaluating natural language processing (NLP) capabilities within the context of software engineering. It categorizes benchmarks by task type (Text2Code, Text2Text, Code2Text), listing the benchmark name, supported languages, and the number of problems or samples included in each benchmark. This detailed breakdown helps researchers select appropriate benchmarks based on their specific needs and research focus.
read the caption
TABLE VI: Overview of Natural Language Benchmarks.

Category	Name	Language(s)	No. of Problems
Text2Code	BIRD [jinyang_li_can_2023]	English $\rightarrow$ SQL	12,751
	KaggleDBQA [chia-hsuan_lee_kaggledbqa_2021]	English $\rightarrow$ SQL	272, paired with golden solutions
	StacQc [zhiliang_yao_staqc_2018]	English $\rightarrow$ {Python/SQL}	{147,546 / 119,519}
			question-answer pairs
	Spider(V2⁵⁵5see Table XII) [lei2024spider20evaluatinglanguage]	English $\rightarrow$ SQL	632 queries
	Spider-Syn [gan_towards_2021]	English $\rightarrow$ SQL	(7000 / 1034)
	Spider-Real [deng_structure-grounded_2021]	English $\rightarrow$ SQL	508
	Spider-DK [gan_exploring_2021]	English $\rightarrow$ SQL	535 pairs
	Spider-CN [min_pilot_2019]	Chinese $\rightarrow$ SQL	9691 queries
	SParC [yu_sparc_2019]	English $\rightarrow$ SQL	4,298 question sequences
	Lyra [liang_lyra_2022]	{English, Chinese} $\rightarrow$ {python, SQL}	2000
	DuSQL [wang_dusql_2020]	Chinese $\rightarrow$ SQL	23,797 question/SQL pairs
	CoSQL [yu_cosql_2019]	English $\rightarrow$ SQL	3,007 Question Sequencess

🔼 This table presents a curated list of benchmarks specifically designed for evaluating the performance of AI models on SQL-related tasks. It includes benchmarks categorized by the type of task (such as text-to-SQL code generation) and provides details on the programming language(s) involved and the number of problems or queries in each benchmark.
read the caption
TABLE VII: Overview of SQL-related Benchmarks.

Category	Name	Language(s)	No. of Samples
Programming Languages	CodeTrans [CodeXGLUE-2021]	C#, Java	11,800
	TransCoder-ST [TransCoderST-2022]	C++, Java, Python	437,030
	CoST [CoST-2022]	7 programming languages	16,738
	AVATAR [AVATAR-2023]	Java, Python	7,133 / 476 / 1,906
	Multilingual-Trans [CodeTransOcean-MultilingualTrans-NicheTrans-LLMTrans-DLTrans-2023]	8 programming languages	30,419 total
	NicheTrans [CodeTransOcean-MultilingualTrans-NicheTrans-LLMTrans-DLTrans-2023]	Various niche languages	236,468 total
	LLMTrans [CodeTransOcean-MultilingualTrans-NicheTrans-LLMTrans-DLTrans-2023]	8 programming languages	350
	G-TransEva [jiao_evaluation_2023]	5 programming languages	400 total
	CODEDITOR [zhang_multilingual_2023]	C# & Java	6613
Libraries	DLTrans [CodeTransOcean-MultilingualTrans-NicheTrans-LLMTrans-DLTrans-2023]	PyTorch, TensorFlow,
		MXNet, Paddle	408 total
Intermediate Representation	SLTrans [paul_ircoder_2024]	14 Languages $\rightarrow$ LLVM-IR	4M
Language Conversion Frameworks	MultiPL-E [MultiPL-E-2022]	19 programming languages	-
Language Conversion Frameworks	MultiEval [MultiEval-HumanEval-MBXP-MathQA-X-2022]	13 programming languages	-

🔼 This table presents a comprehensive overview of existing benchmarks for evaluating programming language translation models. It categorizes benchmarks by the type of programming languages involved (e.g., specific languages or multiple languages), intermediate representations used, and associated frameworks. The ‘No. of Samples’ column indicates the number of training, development, and test samples available in each benchmark dataset, clearly highlighting the scale and scope of each benchmark. This detailed breakdown helps researchers choose the most appropriate benchmark based on the languages and methods they are working with.
read the caption
TABLE VIII: Overview of Programming Language Translation Benchmarks (Note: X/Y/Z denotes Train/Dev/Test).

Category	Benchmark	Language(s)	No. of Problems
Software Development & Agent Benchmarks	DevBench [devbench-2024]	Python, C/C++, Java, JavaScript	22 repositories
	DevEval [DevEval-2024]	Python	1,874
	CoderUJB [CoderUJB-2024]	Java	2,239
	CODAL [m_weyssow_codeultrafeedback_2024]	Python	500
	ToolQA [zhuang_toolqa_2023]	Python, Math, English	800(Easy)/730(Hard)
	MIT [wang_mint_2024]	Python, English	586 Problems
	SAFIM [gong_evaluation_2024]	Python, Java, C++, C#	17720
	AgentBench [liu_agentbench_2023]	N/A	1360 prompts
Class Level	ClassEval [ClassEval-2023]	Python	100
	CONCODE [Concode-2018]	English, Java	104,000
	BigCodeBench [BigCodeBench-2024]	Python	1,140
Project & Cross-file	SWE-bench [SWE-bench-2023]	Python	19,008 (Train), 225 (Dev),
			2,294 (Test)
			144 (Small)
	CrossCodeEval [CrossCodeEval-2023]	C#, TypeScript, Java, Python	2,665 (Python), 2,139 (Java),
			3,356 (TypeScript), 1,768 (C#)
	CoderEval [CoderEval-2023]	Java, Python	230
	DotPrompts [agrawal_guiding_2023]	Java	105538 problems (1420 methods)
	BigCloneBench [svajlenko_evaluating_2015]	Java	25,000 Java Systms
	DI-Bench [zhang2025dibenchbenchmarkinglargelanguage]	Python, C#, Rust, JS	581 repositories (w/ dependencies)
	DyPyBench [Bouzenia_2024]	Python	50 repositories
Repository Level	RepoBench [RepoBench-2023]	Python, Java	Cross-file: 8,033, In-file: 7,910
	RepoEval [RepoEval-2023]	Python	1,600 (line),
			1,600 (API), 373 (function)
	EvoCodeBench [EvoCodeBench-2024]	Python	275
	SketchEval [daoguang_zan_codes_2024]	Python	19 repositories
			(5 easy, 8 medium, 6 hard)
	Stack Repo [daoguang_zan_codes_2024]	Python	(435,890 / 220,615 / 159,822)
			answer pairs
	ML-BENCH [tang_ml-bench_2024]	Python & Bash	9641 problems
	CodeGen4Libs [liu_codegen4libs_2023]	Java	403,780 prompts

🔼 This table presents a collection of real-world software engineering benchmarks. It categorizes benchmarks by their focus area (Software Development & Agent Benchmarks, Class Level, Project & Cross-file, and Repository Level), lists the programming languages used in each benchmark, and indicates the number of problems or samples available for each benchmark. The notation X/Y/Z represents the number of training, development, and testing samples, respectively. This table is designed to provide a comprehensive overview of benchmarks used in evaluating AI models on practical software engineering tasks, offering a variety of complexities and scopes.
read the caption
TABLE IX: Overview of Selected Real-to-Life SE Benchmarks. (Note: X/Y/Z denotes Train/Dev/Test)

Category	Benchmark	Sources/API(s)	No. of Problems
API Prediction	RestBench [RestBench-RestGPT-2023]	Spotify, TMDB	57, 100
	APIBench-Q [APIBENCH-Q-2021]	StackOverflow, Tutorial Websites	6,563 (Java),
			4,309 (Python)
	BIKER [BIKER-Dataset-2018]	StackOverflow	33,000
	Gorilla APIBench [Gorilla-APIBench-APIZoo-2023]	HuggingFace, TensorHub, TorchHub	925, 696, 94
	Gorilla APIZoo [Gorilla-APIBench-APIZoo-2023]	Open submissions	–
		(Google, YouTube, Zoom, etc.)
Retrieval & Planning	API-Bank [API-Bank-2023]	73 commonly used APIs	753
	CodeRAG-Bench [CodeRAG-Bench-2024]	Competition solutions, tutorials,	25,859
	CodeRAG-Bench [CodeRAG-Bench-2024]	documentation, StackOverflow, GitHub	25,859
	Search4Code [rao_search4code_2021]	Bing	6596(java)/4974(c#)
	CoIR [li_coir_2024]	GitHub, StackOverflow, and	2.38M (corpus)
		Various Benchmarks	3.37(queries)
Memorization	SATML-ext [al-kaswan_traces_2024]	GitHub	1,000 samples

🔼 This table presents a categorized overview of selected AI4SE (Artificial Intelligence in Software Engineering) benchmarks focusing on API prediction, retrieval and planning, and memorization tasks. For each benchmark, it lists the name, the data sources or APIs utilized, and the number of problems or samples included. This provides a concise summary of resources available for evaluating AI models in these specific SE sub-domains.
read the caption
TABLE X: Overview of Selected API and Retrieval Benchmarks by Category.

Category	Benchmark	Language(s)	No. of Problems	Crowdsourced
Pseudocode to Code	SPoC [SPoC-2019]	C++	18,356	Yes
Pseudocode to Code	NAPS [NAPS-2018]	Java/UAST	17,477	No
Code to Pseudocode	Django [Django-2015]	Python, English	18,805 (Train), 1,000 (Dev),	No
Code to Pseudocode	Django [Django-2015]	& Japanese	1,805 (Test)	No

🔼 This table provides an overview of the AI4SE (Artificial Intelligence for Software Engineering) benchmarks that specifically focus on pseudocode. It details the benchmark’s name, the programming languages involved, the number of problems or samples within each benchmark, and whether the benchmark is crowdsourced.
read the caption
TABLE XI: Overview of AI4SE Benchmarks Related to Pseudocode.

Benchmark	Language(s)	No. of Problems	Source
WikiSQL [WikiSQL-2017]	Natural language $\rightarrow$ SQL query	80,654	Amazon MTurk (deprecated - 2017)
Spider [Spider-2018]	Natural language $\rightarrow$ SQL query	10,181	11 Yale students (2018)
NL2Bash [xi_victoria_lin_nl2bash_2018]	Natural language $\rightarrow$ Bash	9,305	Upwork (2018)
NAPS [NAPS-2018]	Java/UAST $\rightarrow$ Pseudocode	17,477	Self-hosted crowdsourcing,
NAPS [NAPS-2018]	Java/UAST $\rightarrow$ Pseudocode	17,477	competitive programming community (2018)
SPoC [SPoC-2019]	C++	18,356	Competitive programming websites (2019)
MBPP [MBPP-MathQA-2021]	Python	974	Google Research,
			internal crowdworkers (2021)

🔼 This table presents a collection of AI4SE (Artificial Intelligence for Software Engineering) benchmarks that were developed through crowdsourcing. It details the programming languages used, the number of problems included, and the original source of the benchmark data. Crowdsourced benchmarks are datasets created through community contributions and often represent real-world scenarios or problems.
read the caption
TABLE XII: Overview of Selected Crowd-sourced Benchmarks.

Category	Benchmark	Language(s)	No. of Samples
Automated Program Repair & Fault Localization	Defects4J [Defects4J-2014]	Java	835
	GitBug-Java [GitBug-Java-2024]	Java	199
	EvalGPTFix [quanjun_zhang_critical_2023]	Java	4530
	TutorCode [boyang_yang_cref_2024]	C++	1239
	GHRB [jae_yong_lee_github_2023]	Java	107
	IntroClass [claire_le_goues_manybugs_2015]	C	998
	ManyBugs [claire_le_goues_manybugs_2015]	7 Languages	185
	DebugBench [runchu_tian_debugbench_2024]	C++, Java, Python	1,438 & 1,401 & 1,414
	QuixBugs [derrick_lin_quixbugs_2017]	Java	40 (locations of bugs)
	RES-Q [beck_labash_res-q_2024]	Python, JS	100 hand-crafted questions
			+ test scripts
	StudentEval [hannah_mclean_babe_studenteval_2023]	Python	1,749 buggy programs on
			48 distinct tasks
			(3 test cases per problem)
	Re-Factory [Hu_refactory_2019]	Python	1783(buggy)/2442(correct)
	ConDefects [wu_condefects_2023]	Python, Java	526(Python), Java(477)
	Cerberus [shariffdeen_cerberus_2023]	C, C++, Java	2242 (across 4 task types)
Vulnerability Detection	CVEFixes [guru_prasad_bhandari_cvefixes_2021]	Various Languages	5,365
	LLMSecEval [catherine_tony_llmseceval_2023]	C	150 (on 25 distinct vulnerabilities)
	SecurityEval [mohammed_latif_siddiq_securityeval_2022]	6 languages	130 (on 75 common vulnerabilities)
	Vul4J [bui_vul4j_2022]	Java	79 vulnerabilities
	FormAI [tihanyi_formai_2023]	C	112,000 labeled instances
	VJBbench [wu_how_2023]	Java	42 vulnerabilities
	SmartBugs [durieux_empirical_2020]	Solidity	69 Vulnerable Smart Contracts
	Devign [zhou_devign_2019]	C	4 large-scale software repositories
	D2A [zheng_d2a_2021]	C/C++	6 OSS Programs
	BigVul [10.1145/3379597.3387501]	C/C++	348 Projects
	SARD ⁶⁶6https://samate.nist.gov/SARD/	Java, C, C++, C#, PHP	32k⁷⁷7As of 4th Feb 2025
	Juliet 1.3 ⁸⁸8https://samate.nist.gov/SARD/test-suites/112	C/C++	64k⁹⁹9As of 4th Feb 2025
	NVD ¹⁰¹⁰10https://nvd.nist.gov/developers/data-sources	Various Languages	265k¹¹¹¹11As of 4th Feb 2025
Software Testing	CoverageEval [michele_tufano_predicting_2023]	Python	1160
	ATLAS [watson_learning_2020]	Java	9,275 projects
	HITS [wang_hits_2024]	Java	10 projects
	MeMo [bareis_code_2022]	Java	9 projects
	MLAPIs [wan_automated_2022]	Python	63 applications

🔼 This table provides a comprehensive overview of existing benchmarks used for evaluating automated program repair, fault localization, and vulnerability detection techniques. It details each benchmark’s name, programming language(s) it supports, and the number of samples or datasets it includes. This information is crucial for researchers to select appropriate benchmarks based on their specific needs and research focus.
read the caption
TABLE XIII: Overview of Automated Program Repair, Fault Localization, and Vulenrability Detection Benchmarks.

Category	Benchmark	Language(s)	No. of Samples
Code Synthesis & Understanding	Methods2Test [Methods2Test-2020]	Java	780,944
	CRUXEval [CRUXEval-2024]	Python	800
	CRQBench [elizabeth_dinella_crqbench_2024]	C++	100
	CriticBench [lin_criticbench_2024]	Python	3,825(across 5 tasks)
	CodeScope [yan_codescope_2024]	8 Programming Langauges	13,390 (across 8 tasks)
Merge Conflict Repair	ConflictBench [ConflictBench-2024]	Java	180
Type Inference	TypeEvalPy [ashwin_prasad_shivarpatna_venkatesh_typeevalpy_2023]	Python	845 (annotated labels)
Type Inference	TypeEvalPy AutoGen [ashwin_prasad_shivarpatna_venkatesh_typeevalpy_2023]	Python	78373 (annotated labels)
Automatic Code Quality Review	CodeReview [zhiyu_li_automating_2022]	8 languages	7.9M pull requests
	Software Maintainability [markus_schnappinger_defining_2020]	Java	519 projects
			(evaluations of quality)
Hallucination Detection	HALLUCODE [fang_liu_exploring_2024]	Python	5,663

🔼 This table presents a collection of benchmarks focusing on various software engineering workflows. It details the specific tasks covered by each benchmark, the programming languages involved, and the number of samples or problems included. These benchmarks are used to evaluate AI models’ performance in these specific SE workflows.
read the caption
TABLE XIV: Overview of Selected SE-Workflow Benchmarks.

Name	Language(s)	Tasks	Information
Big-Bench [BIG-Bench-2022]	Python, Numeric, JSON, English	Functions over numbers, Mathematical Reasoning, Text2Code, Code2Text, Code explanation, Debugging, Turing Complete Concept Learning, amongst other tasks	250, several per category, 42, 60, 66, 34, 6,390
XLCoST [XLCoST-2022]	C, C++, C#, Java, JavaScript, Kotlin, PHP, Python, Ruby, Rust	Text2Code (program synthesis, code search), Code Summarization, Code Translation	567K (509k, 58k), 567K, 122K
CrossCodeBench [changan_niu_crosscodebench_2023]	Java, C#, Python, C++, JS, PHP, Go, Ruby, TS, C, Bash, Shell	Classification, In-Filling, Translation, Generation, Summarization, Type Prediction, Question Answering	6.6M, 13.4M, 2.4M, 19.5M, 11.2M, 773K, 190K
Long Code Arena [long-code-arena-2024]	English, Python, Java, Kotlin	Commit Message Generation, Module Summarization, Library-Based Code Generation, Project-Level Code Completion, Bug Localization, CI Builds Repair	163, 216, 150, 908 (varying sizes), 14.96K, 78
CodeXGLUE [long-code-arena-2024] MicrosoftDocs¹²¹²12https://github.com/MicrosoftDocs/ CodeSearchNet [CodeSearchNet-Challenge-2019]	English, Chinese, Norwegian, Danish, Latvian Go, Java, JavaScript, PHP, Python, Ruby	Code Documentation Translation, Code Documentation (Code Summarization, Comment Generation)	(CN: 52K, NOR: 26K, DK: 45K, LT: 21K), 621870

🔼 This table presents a list of benchmarks that evaluate multiple aspects of AI models in software engineering tasks. Unlike single-task benchmarks, these multi-category benchmarks assess a wide range of capabilities, providing a more holistic evaluation of the AI model’s overall performance.
read the caption
TABLE XV: Overview of Multi-Category Benchmarks, Covering Various Tasks.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

AI4SE Review#

BenchScout Tool#

BenchFrame Qlty#

HumanEvalNext#

Bench AI Limit#

More visual insights#

Full paper#