We tested 16 frontier language models on 1,650 real CVEs and a curated 200-instance subset. On VulnBench-200 (best-of-3), the best model fixes 22.5%. On the full 1,650-instance benchmark, the best model fixes 16.6%. The median model: under 4%.
On the curated 200-instance subset (best-of-3), even the best model patches fewer than 1 in 4 vulnerabilities. On the full 1,650-CVE benchmark, performance drops further — and the rankings shift significantly.
Claude Opus 4.6 leads the full benchmark at 16.6% (273/1650), narrowly ahead of GPT-5.3 Codex at 16.4% (271/1650) — a ranking reversal from the curated subset.
GPT-5.3 Codex leads VulnBench-200 (best-of-3) at 22.5%, followed by GPT-5.4 at 18.5% and Claude Opus 4.6 at 16.0%.
The median model fixes only 3.4% of the full 1,650-instance benchmark — roughly 1 in 30 real vulnerabilities.
Claude Opus 4.6 achieves the highest mean judge score on the full benchmark, indicating consistently higher patch quality at scale.
GPT-5.4 drops from 2nd on the curated subset (18.5%) to 3.4% on the full benchmark. Claude Sonnet and Gemini Flash scale better, holding or improving their relative positions.
Gemini 3 Flash achieves 9.7% on the full benchmark (160/1650) at the lowest cost per instance — strong performance for a lightweight model.
Curated 200-instance subset with best-of-3 variance reduction and description-only file hints. All 16 models also evaluated on the full 1,650-instance benchmark (single pass). Judge: Claude Opus 4.6.
| # | Model | Pass Rate | Score | Passed | Avg Time | Cost |
|---|---|---|---|---|---|---|
| 1 | GPT-5.3 Codex OpenAI | 0.468 | 45/200 | 43.4s | $8.74 | |
| 2 | GPT-5.4 OpenAI | 0.407 | 37/200 | 7.3s | $4.81 | |
| 3 | Claude Opus 4.6 Anthropic | 0.404 | 32/200 | 19.6s | $10.17 | |
| 4 | GPT-5.2 OpenAI | 0.322 | 30/200 | 75.2s | $11.30 | |
| 5 | Claude Sonnet 4.6 Anthropic | 0.322 | 21/200 | 16.0s | $6.87 | |
| 6 | Gemini 3 Flash Google | 0.318 | 15/200 | 5.1s | $3.13 | |
| 7 | GLM-5 Zhipu AI | 0.249 | 14/200 | 91.9s | $4.26 | |
| 8 | Kimi K2.5 Moonshot AI | 0.228 | 13/200 | 65.7s | $3.47 | |
| 9 | Grok 4.1 Fast xAI | 0.273 | 11/200 | 61.1s | $3.46 | |
| 10 | GPT-5 Mini OpenAI | 0.275 | 10/200 | 25.4s | $3.63 | |
| 11 | DeepSeek V3.2 DeepSeek | 0.253 | 9/200 | 78.7s | $3.25 | |
| 12 | Claude Haiku 4.5 Anthropic | 0.263 | 7/200 | 7.0s | $3.95 | |
| 13 | Gemini 3.1 Pro Google | 0.093 | 5/200 | 44.4s | $9.60 | |
| 14 | MiniMax M2.5 MiniMax | 0.181 | 3/200 | 45.0s | $3.25 | |
| 14 | MiniMax M2.7 MiniMax | 0.099 | 3/200 | 22.8s | $1.74 | |
| 16 | Step 3.5 Flash StepFun | 0.000 | 0/200 | 44.0s | $0.00 |
All 16 models evaluated on the full 1,650-instance benchmark (single pass). Description-only file hints. Judge: Claude Opus 4.6.
| # | Model | Pass Rate | Score | Passed | Avg Time |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 Anthropic | 0.417 | 273/1650 | 19.3s | |
| 2 | GPT-5.3 Codex OpenAI | 0.301 | 271/1650 | 23.4s | |
| 3 | Claude Sonnet 4.6 Anthropic | 0.335 | 201/1650 | 16.5s | |
| 4 | Gemini 3 Flash Google | 0.311 | 160/1650 | 8.8s | |
| 5 | GPT-5.2 OpenAI | 0.129 | 111/1650 | 26.1s | |
| 6 | GLM-5 Zhipu AI | 0.219 | 98/1650 | 90.6s | |
| 7 | GPT-5 Mini OpenAI | 0.246 | 85/1650 | 18.6s | |
| 8 | Claude Haiku 4.5 Anthropic | 0.259 | 80/1650 | 6.5s | |
| 9 | Gemini 3.1 Pro Google | 0.097 | 58/1650 | 56.1s | |
| 10 | GPT-5.4 OpenAI | 0.092 | 56/1650 | 4.5s | |
| 10 | Kimi K2.5 Moonshot AI | 0.102 | 56/1650 | 25.7s | |
| 12 | MiniMax M2.7 MiniMax | 0.114 | 36/1650 | 24.2s | |
| 13 | MiniMax M2.5 MiniMax | 0.092 | 35/1650 | 23.7s | |
| 14 | Grok 4.1 Fast xAI | 0.108 | 31/1650 | 25.6s | |
| 15 | DeepSeek V3.2 DeepSeek | 0.094 | 28/1650 | 25.2s | |
| 16 | Step 3.5 Flash StepFun | 0.001 | 2/1650 | 42.3s |
VulnBench-200 instances are classified into three difficulty tiers (67 / 67 / 66 instances).
XSS (CWE-79), SQL injection (CWE-89), path traversal (CWE-22). Fixes follow well-known patterns: escape output, parameterize queries, sanitize paths.
Authorization (CWE-862/863), CSRF (CWE-352), information disclosure (CWE-200). Requires understanding application logic to add missing checks.
Code injection (CWE-94), resource exhaustion (CWE-400), input validation (CWE-20). Requires deep reasoning about execution flow and system behavior.
Each model receives the vulnerability description, affected source code, and CWE-specific guidance. An LLM judge (Claude Opus 4.6) scores the candidate patch against the ground-truth fix commit.
10,000+ CVEs from the GitHub Advisory Database, enriched with CVSS scores and CWE IDs from NVD.
Select CVEs with fix commits, download repository snapshots, extract ground-truth diffs, scrub prompt leakage, then quality-score and tier-classify.
Each model receives CVE description, CWE guidance, affected files, and vulnerable source. Must generate a unified diff patch.
Claude Opus 4.6 compares candidate against reference: root cause addressed, no new vulnerabilities, correct scope. Score 0.0–1.0.
1,650 real CVE instances in the full benchmark, with a curated 200-instance evaluation subset from 200 unique repositories balanced across difficulty tiers.
VulnBench is fully open source and reproducible. Evaluate any LLM on the curated 200-instance subset with a single command.
$ vulnbench run \
--model claude-opus-4.6 \
--subset vulnbench-full \
--judge claude-opus-4.6
[████████████████████░░░░] 1354/1650
Model Claude Opus 4.6
Pass Rate 16.6%
Score 0.417
Passed 273/1650
Time 19.3s