BaxBench: Can LLMs Generate Secure and Correct Backends?

We are calling for community contributions! Please visit our GitHub repository for details.
TL;DR: We introduce a novel benchmark to evaluate LLMs on secure and correct code generation, showing that even flagship LLMs are not ready for coding automation, frequently generating insecure or incorrect code.

Key Takeaways

  • 62% of the solutions generated even by the best model are either incorrect or contain a security vulnerability, highlighting that LLMs cannot yet generate deployment-ready code.
  • On average, around half of the correct solutions are insecure, raising concerns about current metrics and evaluations focusing only on code correctness.
  • Security requirements add coding complexity and lead to correctness tradeoffs, indicating that targeted efforts are required to increase the secure and correct coding rates of models.

BaxBench Leaderboard

In the leaderboard below, we show the performance of state-of-the-art LLMs tested on BaxBench. The leaderboard can be toggled between three different prompt types with varying levels of security-specific instructions, detailed below the leaderboard for each view. See our paper for more results.
Rank Model Correct &
Secure
Correct % Insecure
of Correct
Rank Model Correct &
Secure
Correct % Insecure
of Correct
Rank Model Correct &
Secure
Correct % Insecure
of Correct
The models are only prompted to complete the coding task. The prompt contains no security-specific instructions, reflecting a realistic interaction with a developer that does not make explicit security considerations.
The models are prompted to complete the coding task and are explicitly reminded to make security considerations and follow security best-practices.
The models are prompted to complete the coding task and are explicitly reminded to avoid specific security vulnerabilities that could occur in the given task. This setting assumes an unrealistic oracle that anticipates all security pitfalls. This prompt provides an upper bound on the models' security performance.

What is BaxBench?

Overview
BaxBench is a novel coding benchmark for evaluating the ability of LLMs on generating correct and secure code in realistic, security-critical settings.

The benchmark consists of 392 security-critical backend coding tasks. These tasks are formed by combining 28 coding scenarios with 14 popular backend development frameworks across 6 programming languages. Each scenario consists of an OpenAPI specification and a textual description of the API endpoints the backend application should implement. Additionally, each scenario comes with a set of functional tests and expert-designed security exploits used to test the correctness and security of the LLM-generated solutions. The LLM-generated solutions are then tested for functional correctness and security by running the tests and exploits associated with each scenario.

How are BaxBench tasks tested?

The LLM-generated solutions are tested for correctness using the end-to-end functionality tests provided for each scenario. Then, to assess the vulnerability of the generated programs, we execute concrete security exploits on the generated solutions. These exploits are designed by security experts w.r.t. the scenario and are automatically executed on the models' solutions. We implement two types of security exploits: (i) black-box exploits that attack the application using malicious queries, e.g., SQL injections or path traversal, and (ii) white-box exploits, e.g., checking for passwords or unencrypted secrets in the artifacts produced by the application. Both the functional tests and the security exploits are framework- and implementation-agnostic enabling the modular scalability of the benchmark.

How can I evaluate my model on BaxBench?

For generating solutions to BaxBench tasks, we make a dataset of the tasks available on Hugging Face. The dataset includes the scenario specifications, package requirements for each implementation framework, and the list of potential CWEs associated with each scenario. With this dataset, the prompts used in our evaluation can be perfectly reconstructed.

The generated solutions then can be tested using our codebase, which contains the functional tests and security exploits for execution. Instructions for running the tests are provided in the repository.

How can I contribute to BaxBench?

We welcome scenario, framework, test, or exploit contributions as well as general feedback from the community. Please visit our GitHub repository for details.

Citation

@article{vero2025baxbenchllmsgeneratecorrect,
        title={BaxBench: Can LLMs Generate Correct and Secure Backends?}, 
        author={Mark Vero and Niels Mündler and Victor Chibotaru and Veselin Raychev and Maximilian Baader and Nikola Jovanović and Jingxuan He and Martin Vechev},
        year={2025},
        eprint={2502.11844},
        archivePrefix={arXiv},
}