Bootstrapping Code Security Benchmarking

Key Takeaways

AutoBaxBuilder generates complete benchmark tasks (scenarios, tests, exploits) in under 2 hours for less than USD 10, representing a ~12× reduction in effort compared to manual creation.
Generated tests and exploits successfully match or outperform expert-written ones on BaxBench scenarios, tightening the upper security bound.
The framework addresses benchmark contamination and enables continuous expansion with varying difficulty levels to challenge increasingly capable LLMs.

AutoBaxBench Leaderboard

In the leaderboard below, we show the performance of state-of-the-art LLMs tested on AutoBaxBench scenarios. The leaderboard can be toggled between the Easy, Medium and Hard subsets, which have 1, 3 and 5 API endpoints to implement respectively. See our paper for more results.

Rank	Model	Correct & Secure	Correct	% Insecure of Correct

Rank	Model	Correct & Secure	Correct	% Insecure of Correct

Rank	Model	Correct & Secure	Correct	% Insecure of Correct

The models are only prompted to complete the coding task. The prompt contains no security-specific instructions, reflecting a realistic interaction with a developer that does not make explicit security considerations.

Models marked with * were used for task generation, indicating potential contamination.

How does AutoBaxBuilder work?

AutoBaxBuilder is an LLM-based pipeline that starts from scratch and produces a complete benchmark instance with scenario description, test cases and end-to-end exploits.

The AutoBaxBuilder pipeline employs an agentic LLM-based approach that starts from scratch and produces complete benchmark instances. The pipeline first generates novel scenario descriptions, functional tests and solutions, iterating until execution feedback confirms that the tests are solvable. Next, the LLM designs end-to-end exploits to expose vulnerabilities, iterating until it finds a pair of solutions, one on which the exploit succeeds and one on which it fails. The results are combined into a new task instance.

AutoBaxBuilder vs Expert-Written Tests

The trend and overall performance of LLMs on AutoBaxBuilder-generated tests and exploits is similar to the performance on expert-written tests from the original BaxBench.

To validate the quality of AutoBaxBuilder, we compared its generated tests and exploits against those written by security experts on the same BaxBench scenarios. Our evaluation shows that AutoBaxBuilder successfully reproduces or outmatches the expert-written functional tests and exploits, tightening the upper security bound reported by BaxBench.

Citation

@article{vonarx2025autobaxbuilderbootstrappingcodesecurity, title={AutoBaxBuilder: Bootstrapping Code Security Benchmarking}, author={Tobias von Arx and Niels Mündler and Mark Vero and Maximilian Baader and Martin Vechev}, year={2025}, eprint={2512.21132}, archivePrefix={arXiv}, }