VulnBench evaluates AI coding models on real CVEs. Models receive the vulnerability description and source context, must find and patch the flaw without seeing the reference fix, and are scored on whether the patch fixes the root cause safely.
The top model solved 75/200 real vulnerability repair tasks, while the median model solved 20/200. The results show clear progress from the strongest systems, but blind CVE repair remains difficult across the field.
Claude Fable 5 leads the leaderboard with 75/200 accepted patches and a 0.443 mean judge score.
GPT-5.5 ranks second at 33.0%, showing a 4.5-point gap from the leader.
The median model still fixes only a small share of real CVEs, even with source context and multiple attempts.
GPT-5.3 Codex has the highest mean judge score, reflecting stronger partial-credit patch quality.
The leaderboard compares frontier, coding-specialized, and open-weight model families on the same vulnerability repair workload.
Even the leading model leaves most tasks without an accepted fix, underscoring how hard blind security patching remains.
Pass rates show how often each model produced a patch judged to fix the vulnerable behavior without seeing the reference fix.
| # | Model | Pass Rate | Score | Passed | Avg Time | Cost |
|---|---|---|---|---|---|---|
| 1 | Claude Fable 5 Anthropic | 0.443 | 75/200 | 40.5s | $34.80 | |
| 2 | GPT-5.5 OpenAI | 0.338 | 66/200 | 177.6s | $53.83 | |
| 3 | Claude Opus 4.8 Anthropic | 0.418 | 59/200 | 20.3s | $12.34 | |
| 4 | Qwen 3.7 Max Qwen | 0.403 | 57/200 | 278.9s | $13.78 | |
| 5 | GPT-5.3 Codex OpenAI | 0.468 | 45/200 | 43.4s | $8.74 | |
| 6 | Kimi K2.5 Moonshot AI | 0.301 | 39/200 | 123.8s | $12.78 | |
| 7 | Grok Build 0.1 xAI | 0.332 | 38/200 | 117.1s | $10.87 | |
| 8 | GPT-5.4 OpenAI | 0.407 | 37/200 | 7.3s | $4.81 | |
| 9 | Qwen 3.7 Plus Qwen | 0.292 | 33/200 | 260.2s | $6.59 | |
| 10 | Claude Opus 4.6 Anthropic | 0.404 | 32/200 | 19.6s | $10.17 | |
| 11 | DeepSeek V4 Pro DeepSeek | 0.250 | 31/200 | 78.8s | $12.44 | |
| 12 | GPT-5.2 OpenAI | 0.322 | 30/200 | 75.2s | $11.30 | |
| 13 | GLM 5.1 Z.AI | 0.165 | 25/200 | 123.9s | $11.24 | |
| 14 | Grok 4.3 xAI | 0.257 | 24/200 | 7.1s | $4.00 | |
| 15 | GPT-5.4 Mini OpenAI | 0.228 | 24/200 | 2.9s | $4.12 | |
| 16 | Nemotron 3 Ultra 550B NVIDIA | 0.206 | 24/200 | 19.9s | $5.08 | |
| 17 | Kimi K2.6 Moonshot AI | 0.120 | 22/200 | 255.4s | $10.20 | |
| 18 | Claude Sonnet 4.6 Anthropic | 0.322 | 21/200 | 16.0s | $6.87 | |
| 19 | DeepSeek V4 Flash DeepSeek | 0.226 | 20/200 | 26.1s | $3.71 | |
| 20 | Qwen 3.5 35B A3B Qwen | 0.168 | 20/200 | 11.0s | $9.42 | |
| 21 | Qwen 3.5 27B Qwen | 0.178 | 19/200 | 125.4s | $12.13 | |
| 22 | MiniMax M3 MiniMax | 0.165 | 19/200 | 58.3s | $2.91 | |
| 23 | GLM 5.2 Z.AI | 0.174 | 18/200 | 21.3s | $9.98 | |
| 24 | MiniMax M2.7 MiniMax | 0.132 | 16/200 | 48.4s | $9.25 | |
| 25 | Gemini 3 Flash Google | 0.318 | 15/200 | 5.1s | $3.13 | |
| 26 | GLM 5 Z.AI | 0.249 | 14/200 | 91.9s | $4.26 | |
| 27 | Mistral Medium 3.5 Mistral AI | 0.161 | 14/200 | 12.0s | $4.64 | |
| 28 | Kimi K2.7 Code Moonshot AI | 0.067 | 12/200 | 172.5s | $8.33 | |
| 29 | Grok 4.1 Fast xAI | 0.273 | 11/200 | 61.1s | $3.46 | |
| 30 | GPT-5 Mini OpenAI | 0.275 | 10/200 | 25.4s | $3.63 | |
| 31 | DeepSeek V3.2 DeepSeek | 0.253 | 9/200 | 78.7s | $3.25 | |
| 32 | Gemini 3.5 Flash Google | 0.047 | 9/200 | 23.6s | $10.33 | |
| 33 | Claude Haiku 4.5 Anthropic | 0.263 | 7/200 | 7.0s | $3.95 | |
| 34 | Gemini 3.1 Pro Google | 0.093 | 5/200 | 44.4s | $9.60 | |
| 35 | MiniMax M2.5 MiniMax | 0.181 | 3/200 | 45.0s | $3.25 | |
| 36 | Step 3.7 Flash StepFun | 0.018 | 3/200 | 44.9s | $1.12 | |
| 37 | Step 3.5 Flash StepFun | 0.000 | 0/200 | 44.0s | $0.00 |
Each model receives a real vulnerability description and relevant source context, then must produce a focused patch. The benchmark measures whether that patch addresses the root cause without unsafe side effects.
All models run against the same 200 vulnerability repair tasks drawn from real repositories and disclosed CVEs.
Prompts include vulnerable source snippets and vulnerability-derived localization hints, but not the reference fix.
Each model gets three independent attempts per task; the leaderboard reflects the strongest observed run for that model.
Candidate patches are reviewed against the intended security fix for root cause coverage, scope, and safety. A split two-judge vote passes when either judge accepts the patch.
The benchmark uses 200 curated CVEs from 200 repositories, balanced across three difficulty tiers and spanning 55 CWE types.
The repository includes the curated CVE set, model outputs, judge decisions, and report generation scripts used to produce this leaderboard.
$ ./run_curated_200_best3.sh Subset VulnBench-200 Runs best-of-3 Judge see row metadata Hints description-only Leader Claude Fable 5 Pass Rate 37.5% Passed 75/200