We tested 16 frontier language models on the curated VulnBench-200 evaluation subset. The best model fixes 57% of those instances. The median: only 22%.
Even the best models leave nearly half of real vulnerabilities unpatched. Performance varies dramatically across model families, vulnerability types, and cost tiers.
GPT-5.3 Codex leads the benchmark, successfully patching 114 of 200 real vulnerabilities — 11.5 percentage points ahead of second place.
The median model fixes roughly 1 in 5 vulnerabilities. There is a sharp drop-off after the top 5 models, which all exceed 40%.
DeepSeek V3.2 achieves 23.5% pass rate at $0.001 per instance — 60x cheaper than the top model with half the accuracy.
Even on the easiest vulnerabilities (XSS, SQL injection), no model exceeds 62%. Pattern-matching fixes are not yet solved.
Most models score 0% on Rust and Swift CVEs. Security patch generation remains heavily skewed toward JavaScript and Python ecosystems.
All 16 models evaluated on all 200 instances for under $50 total via OpenRouter — making this benchmark highly reproducible.
All models evaluated on the same 200 CVE instances using Claude Opus 4.6 as judge. Temperature 0.0, max 4096 tokens.
| # | Model | Pass Rate | Score | Passed | Avg Time | Cost |
|---|---|---|---|---|---|---|
| 1 | GPT-5.3 Codex OpenAI |
0.514 | 111/200 | 40.5s | $6.24 | |
| 2 | Claude Opus 4.6 Anthropic |
0.457 | 91/200 | 18.8s | $6.46 | |
| 3 | GPT-5.4 OpenAI |
0.431 | 85/200 | 7.9s | $1.94 | |
| 4 | Claude Sonnet 4.6 Anthropic |
0.419 | 84/200 | 13.7s | $3.65 | |
| 5 | GPT-5.2 OpenAI |
0.374 | 84/200 | 67.6s | $8.50 | |
| 6 | Gemini 3 Flash |
0.373 | 67/200 | 4.2s | $0.34 | |
| 7 | Kimi K2.5 Moonshot AI |
0.301 | 58/200 | 261s | $1.98 | |
| 8 | Claude Haiku 4.5 Anthropic |
0.317 | 50/200 | 7.0s | $1.06 | |
| 9 | DeepSeek V3.2 DeepSeek |
0.314 | 47/200 | 36.1s | $0.17 | |
| 10 | GLM-5 Zhipu AI |
0.268 | 39/200 | 80.1s | $2.32 | |
| 11 | GPT-5 Mini OpenAI |
0.307 | 35/200 | 27.6s | $0.71 | |
| 12 | Qwen 3.5-27B Alibaba |
0.259 | 30/200 | 163s | $4.53 | |
| 13 | MiniMax M2.5 MiniMax |
0.248 | 29/200 | 49.0s | $0.47 | |
| 14 | Qwen 3.5-35B-A3B Alibaba |
0.235 | 22/200 | 28.8s | $1.01 | |
| 15 | Gemini 3.1 Pro |
0.093 | 13/200 | 60.0s | $9.96 | |
| 16 | Step 3.5 Flash StepFun |
0.000 | 0/200 | 24.4s | $0.00 |
Vulnerabilities are categorized into three difficulty tiers based on the complexity of the required fix. The benchmark is balanced: 67 / 67 / 66 instances across tiers.
| Model | Tier 1 Pattern | Tier 2 Logic | Tier 3 Deep |
|---|---|---|---|
| GPT-5.3 Codex | 61.2% | 53.7% | 51.5% |
| Claude Opus 4.6 | 50.7% | 43.3% | 42.4% |
| GPT-5.4 | 46.3% | 32.8% | 48.5% |
| Claude Sonnet 4.6 | 41.8% | 40.3% | 43.9% |
| GPT-5.2 | 47.8% | 41.8% | 36.4% |
| Gemini 3 Flash | 35.8% | 26.9% | 37.9% |
| Kimi K2.5 | 35.8% | 25.4% | 25.8% |
| Claude Haiku 4.5 | 23.9% | 26.9% | 24.2% |
| DeepSeek V3.2 | 25.4% | 22.4% | 22.7% |
XSS (CWE-79), SQL injection (CWE-89), path traversal (CWE-22). Fixes follow well-known patterns: escape output, parameterize queries, sanitize paths.
Authorization (CWE-862/863), CSRF (CWE-352), information disclosure (CWE-200). Requires understanding application logic to add missing checks.
Code injection (CWE-94), resource exhaustion (CWE-400), input validation (CWE-20). Requires deep reasoning about execution flow and system behavior.
Each model receives the vulnerability description, affected source code, and CWE-specific guidance. An LLM judge (Claude Opus 4.6) scores the candidate patch against the ground-truth fix commit.
10,000+ CVEs from the GitHub Advisory Database, enriched with CVSS scores and CWE IDs from NVD.
Select CVEs with fix commits, download repository snapshots, extract ground-truth diffs, scrub prompt leakage, then quality-score and tier-classify.
Each model receives CVE description, CWE guidance, affected files, and vulnerable source. Must generate a unified diff patch.
Claude Opus 4.6 compares candidate against reference: root cause addressed, no new vulnerabilities, correct scope. Score 0.0–1.0.
200 curated instances from 200 unique GitHub repositories, balanced across difficulty tiers.
Critical: 21 High: 42 Medium: 137
CWE-79 (XSS): 38 CWE-22 (Path Traversal): 25 CWE-400 (DoS): 25 CWE-20 (Input Val): 23 CWE-94 (Code Inj): 19
npm: 134 pip: 54 Maven: 5 RubyGems: 3 Composer: 2 Rust: 1 Swift: 1
2026: 16 2025: 55 2024: 48 2023: 26 2022: 24 Earlier: 31
VulnBench is fully open source and reproducible. Evaluate any LLM in under an hour for less than $10.