GHOST
[F] Findings[L] Curated 200[M] Methodology[D] Dataset[V1] Original
Blind vulnerability find and fix benchmark/Ghost Security

VulnBench V2 Benchmark

VulnBench evaluates AI coding models on real CVEs. Models receive the vulnerability description and source context, must find and patch the flaw without seeing the reference fix, and are scored on whether the patch fixes the root cause safely.

37
Models Final
200
Curated CVEs
3x
Runs Per Model
55
CWE Types
7
Ecosystems
Key Findings

What We Learned

The top model solved 75/200 real vulnerability repair tasks, while the median model solved 20/200. The results show clear progress from the strongest systems, but blind CVE repair remains difficult across the field.

37.5%
Best Pass Rate

Claude Fable 5 leads the leaderboard with 75/200 accepted patches and a 0.443 mean judge score.

66/200
Runner Up

GPT-5.5 ranks second at 33.0%, showing a 4.5-point gap from the leader.

10.0%
Median Pass Rate

The median model still fixes only a small share of real CVEs, even with source context and multiple attempts.

0.468
Top Mean Score

GPT-5.3 Codex has the highest mean judge score, reflecting stronger partial-credit patch quality.

37
Models Tested

The leaderboard compares frontier, coding-specialized, and open-weight model families on the same vulnerability repair workload.

125/200
Remaining Gap

Even the leading model leaves most tasks without an accepted fix, underscoring how hard blind security patching remains.

Results

VulnBench-200 Model Leaderboard

Pass rates show how often each model produced a patch judged to fix the vulnerable behavior without seeing the reference fix.

#ModelPass RateScorePassedAvg TimeCost
1
Claude Fable 5
Anthropic
37.5%
0.44375/20040.5s$34.80
2
GPT-5.5
OpenAI
33.0%
0.33866/200177.6s$53.83
3
Claude Opus 4.8
Anthropic
29.5%
0.41859/20020.3s$12.34
4
Qwen 3.7 Max
Qwen
28.5%
0.40357/200278.9s$13.78
5
GPT-5.3 Codex
OpenAI
22.5%
0.46845/20043.4s$8.74
6
Kimi K2.5
Moonshot AI
19.5%
0.30139/200123.8s$12.78
7
Grok Build 0.1
xAI
19.0%
0.33238/200117.1s$10.87
8
GPT-5.4
OpenAI
18.5%
0.40737/2007.3s$4.81
9
Qwen 3.7 Plus
Qwen
16.5%
0.29233/200260.2s$6.59
10
Claude Opus 4.6
Anthropic
16.0%
0.40432/20019.6s$10.17
11
DeepSeek V4 Pro
DeepSeek
15.5%
0.25031/20078.8s$12.44
12
GPT-5.2
OpenAI
15.0%
0.32230/20075.2s$11.30
13
GLM 5.1
Z.AI
12.5%
0.16525/200123.9s$11.24
14
Grok 4.3
xAI
12.0%
0.25724/2007.1s$4.00
15
GPT-5.4 Mini
OpenAI
12.0%
0.22824/2002.9s$4.12
16
Nemotron 3 Ultra 550B
NVIDIA
12.0%
0.20624/20019.9s$5.08
17
Kimi K2.6
Moonshot AI
11.0%
0.12022/200255.4s$10.20
18
Claude Sonnet 4.6
Anthropic
10.5%
0.32221/20016.0s$6.87
19
DeepSeek V4 Flash
DeepSeek
10.0%
0.22620/20026.1s$3.71
20
Qwen 3.5 35B A3B
Qwen
10.0%
0.16820/20011.0s$9.42
21
Qwen 3.5 27B
Qwen
9.5%
0.17819/200125.4s$12.13
22
MiniMax M3
MiniMax
9.5%
0.16519/20058.3s$2.91
23
GLM 5.2
Z.AI
9.0%
0.17418/20021.3s$9.98
24
MiniMax M2.7
MiniMax
8.0%
0.13216/20048.4s$9.25
25
Gemini 3 Flash
Google
7.5%
0.31815/2005.1s$3.13
26
GLM 5
Z.AI
7.0%
0.24914/20091.9s$4.26
27
Mistral Medium 3.5
Mistral AI
7.0%
0.16114/20012.0s$4.64
28
Kimi K2.7 Code
Moonshot AI
6.0%
0.06712/200172.5s$8.33
29
Grok 4.1 Fast
xAI
5.5%
0.27311/20061.1s$3.46
30
GPT-5 Mini
OpenAI
5.0%
0.27510/20025.4s$3.63
31
DeepSeek V3.2
DeepSeek
4.5%
0.2539/20078.7s$3.25
32
Gemini 3.5 Flash
Google
4.5%
0.0479/20023.6s$10.33
33
Claude Haiku 4.5
Anthropic
3.5%
0.2637/2007.0s$3.95
34
Gemini 3.1 Pro
Google
2.5%
0.0935/20044.4s$9.60
35
MiniMax M2.5
MiniMax
1.5%
0.1813/20045.0s$3.25
36
Step 3.7 Flash
StepFun
1.5%
0.0183/20044.9s$1.12
37
Step 3.5 Flash
StepFun
0.0%
0.0000/20044.0s$0.00
Methodology

Blind Vulnerability Repair Evaluation

Each model receives a real vulnerability description and relevant source context, then must produce a focused patch. The benchmark measures whether that patch addresses the root cause without unsafe side effects.

1
Real CVEs

All models run against the same 200 vulnerability repair tasks drawn from real repositories and disclosed CVEs.

2
Source + Hints

Prompts include vulnerable source snippets and vulnerability-derived localization hints, but not the reference fix.

3
Three Attempts

Each model gets three independent attempts per task; the leaderboard reflects the strongest observed run for that model.

4
Patch Judging

Candidate patches are reviewed against the intended security fix for root cause coverage, scope, and safety. A split two-judge vote passes when either judge accepts the patch.

Dataset

Benchmark Dataset

The benchmark uses 200 curated CVEs from 200 repositories, balanced across three difficulty tiers and spanning 55 CWE types.

200
CVE Tasks
200
Repositories
55
CWE Types
7
Ecosystems
36
Mean Lines
1.9
Mean Files

Reproduce The Benchmark

The repository includes the curated CVE set, model outputs, judge decisions, and report generation scripts used to produce this leaderboard.

$ ./run_curated_200_best3.sh

  Subset     VulnBench-200
  Runs       best-of-3
  Judge      see row metadata
  Hints      description-only

  Leader     Claude Fable 5
  Pass Rate  37.5%
  Passed     75/200