Rank | Model | Correct & Secure |
Correct | % Insecure of Correct |
---|
Rank | Model | Correct & Secure |
Correct | % Insecure of Correct |
---|
Rank | Model | Correct & Secure |
Correct | % Insecure of Correct |
---|
The benchmark consists of 392 security-critical backend coding tasks. These tasks are formed by combining 28 coding scenarios with 14 popular backend development frameworks across 6 programming languages. Each scenario consists of an OpenAPI specification and a textual description of the API endpoints the backend application should implement. Additionally, each scenario comes with a set of functional tests and expert-designed security exploits used to test the correctness and security of the LLM-generated solutions. The LLM-generated solutions are then tested for functional correctness and security by running the tests and exploits associated with each scenario.
The LLM-generated solutions are tested for correctness using the end-to-end functionality tests provided for each scenario. Then, to assess the vulnerability of the generated programs, we execute concrete security exploits on the generated solutions. These exploits are designed by security experts w.r.t. the scenario and are automatically executed on the models' solutions. We implement two types of security exploits: (i) black-box exploits that attack the application using malicious queries, e.g., SQL injections or path traversal, and (ii) white-box exploits, e.g., checking for passwords or unencrypted secrets in the artifacts produced by the application. Both the functional tests and the security exploits are framework- and implementation-agnostic enabling the modular scalability of the benchmark.
For generating solutions to BaxBench tasks, we make a dataset of the tasks available on Hugging Face. The dataset includes the scenario specifications, package requirements for each implementation framework, and the list of potential CWEs associated with each scenario. With this dataset, the prompts used in our evaluation can be perfectly reconstructed.
The generated solutions then can be tested using our codebase, which contains the functional tests and security exploits for execution. Instructions for running the tests are provided in the repository.
We welcome scenario, framework, test, or exploit contributions as well as general feedback from the community. Please visit our GitHub repository for details.
@article{vero2025baxbenchllmsgeneratecorrect, title={BaxBench: Can LLMs Generate Correct and Secure Backends?}, author={Mark Vero and Niels Mündler and Victor Chibotaru and Veselin Raychev and Maximilian Baader and Nikola Jovanović and Jingxuan He and Martin Vechev}, year={2025}, eprint={2502.11844}, archivePrefix={arXiv}, }