[F] Findings [L] Curated 200 [B] Full 1650 [M] Methodology [D] Dataset [H] Home
RSA Conference 2026 / Ghost Security

Can LLMs Fix Real-World
Security Vulnerabilities?

We tested 16 frontier language models on 1,650 real CVEs and a curated 200-instance subset. On VulnBench-200 (best-of-3), the best model fixes 22.5%. On the full 1,650-instance benchmark, the best model fixes 16.6%. The median model: under 4%.

16
Models Tested
1,650
Full Benchmark CVEs
200
Curated Subset
55
CWE Types
7
Ecosystems
Key Findings

What We Learned

On the curated 200-instance subset (best-of-3), even the best model patches fewer than 1 in 4 vulnerabilities. On the full 1,650-CVE benchmark, performance drops further — and the rankings shift significantly.

16.6%
Best on Full 1,650

Claude Opus 4.6 leads the full benchmark at 16.6% (273/1650), narrowly ahead of GPT-5.3 Codex at 16.4% (271/1650) — a ranking reversal from the curated subset.

22.5%
Best on Curated 200

GPT-5.3 Codex leads VulnBench-200 (best-of-3) at 22.5%, followed by GPT-5.4 at 18.5% and Claude Opus 4.6 at 16.0%.

3.4%
Full Benchmark Median

The median model fixes only 3.4% of the full 1,650-instance benchmark — roughly 1 in 30 real vulnerabilities.

0.417
Highest Full Benchmark Score

Claude Opus 4.6 achieves the highest mean judge score on the full benchmark, indicating consistently higher patch quality at scale.

Rankings Shift
Curated vs Full

GPT-5.4 drops from 2nd on the curated subset (18.5%) to 3.4% on the full benchmark. Claude Sonnet and Gemini Flash scale better, holding or improving their relative positions.

9.7%
Best Value Model

Gemini 3 Flash achieves 9.7% on the full benchmark (160/1650) at the lowest cost per instance — strong performance for a lightweight model.

Results

VulnBench-200 Leaderboard

Curated 200-instance subset with best-of-3 variance reduction and description-only file hints. All 16 models also evaluated on the full 1,650-instance benchmark (single pass). Judge: Claude Opus 4.6.

# Model Pass Rate Score Passed Avg Time Cost
1
GPT-5.3 Codex
OpenAI
22.5%
0.46845/20043.4s$8.74
2
GPT-5.4
OpenAI
18.5%
0.40737/2007.3s$4.81
3
Claude Opus 4.6
Anthropic
16.0%
0.40432/20019.6s$10.17
4
GPT-5.2
OpenAI
15.0%
0.32230/20075.2s$11.30
5
Claude Sonnet 4.6
Anthropic
10.5%
0.32221/20016.0s$6.87
6
Gemini 3 Flash
Google
7.5%
0.31815/2005.1s$3.13
7
GLM-5
Zhipu AI
7.0%
0.24914/20091.9s$4.26
8
Kimi K2.5
Moonshot AI
6.5%
0.22813/20065.7s$3.47
9
Grok 4.1 Fast
xAI
5.5%
0.27311/20061.1s$3.46
10
GPT-5 Mini
OpenAI
5.0%
0.27510/20025.4s$3.63
11
DeepSeek V3.2
DeepSeek
4.5%
0.2539/20078.7s$3.25
12
Claude Haiku 4.5
Anthropic
3.5%
0.2637/2007.0s$3.95
13
Gemini 3.1 Pro
Google
2.5%
0.0935/20044.4s$9.60
14
MiniMax M2.5
MiniMax
1.5%
0.1813/20045.0s$3.25
14
MiniMax M2.7
MiniMax
1.5%
0.0993/20022.8s$1.74
16
Step 3.5 Flash
StepFun
0.0%
0.0000/20044.0s$0.00
Full Benchmark

VulnBench-1650 Leaderboard

All 16 models evaluated on the full 1,650-instance benchmark (single pass). Description-only file hints. Judge: Claude Opus 4.6.

# Model Pass Rate Score Passed Avg Time
1
Claude Opus 4.6
Anthropic
16.6%
0.417273/165019.3s
2
GPT-5.3 Codex
OpenAI
16.4%
0.301271/165023.4s
3
Claude Sonnet 4.6
Anthropic
12.2%
0.335201/165016.5s
4
Gemini 3 Flash
Google
9.7%
0.311160/16508.8s
5
GPT-5.2
OpenAI
6.7%
0.129111/165026.1s
6
GLM-5
Zhipu AI
5.9%
0.21998/165090.6s
7
GPT-5 Mini
OpenAI
5.1%
0.24685/165018.6s
8
Claude Haiku 4.5
Anthropic
4.9%
0.25980/16506.5s
9
Gemini 3.1 Pro
Google
3.5%
0.09758/165056.1s
10
GPT-5.4
OpenAI
3.4%
0.09256/16504.5s
10
Kimi K2.5
Moonshot AI
3.4%
0.10256/165025.7s
12
MiniMax M2.7
MiniMax
2.2%
0.11436/165024.2s
13
MiniMax M2.5
MiniMax
2.1%
0.09235/165023.7s
14
Grok 4.1 Fast
xAI
1.9%
0.10831/165025.6s
15
DeepSeek V3.2
DeepSeek
1.7%
0.09428/165025.2s
16
Step 3.5 Flash
StepFun
0.1%
0.0012/165042.3s
Difficulty Tiers

Vulnerability Difficulty Classification

VulnBench-200 instances are classified into three difficulty tiers (67 / 67 / 66 instances).

Tier 1 — Pattern

XSS (CWE-79), SQL injection (CWE-89), path traversal (CWE-22). Fixes follow well-known patterns: escape output, parameterize queries, sanitize paths.

Tier 2 — Logic

Authorization (CWE-862/863), CSRF (CWE-352), information disclosure (CWE-200). Requires understanding application logic to add missing checks.

Tier 3 — Deep

Code injection (CWE-94), resource exhaustion (CWE-400), input validation (CWE-20). Requires deep reasoning about execution flow and system behavior.

Methodology

How VulnBench Works

Each model receives the vulnerability description, affected source code, and CWE-specific guidance. An LLM judge (Claude Opus 4.6) scores the candidate patch against the ground-truth fix commit.

1
Collect Real CVEs

10,000+ CVEs from the GitHub Advisory Database, enriched with CVSS scores and CWE IDs from NVD.

2
Build Instances

Select CVEs with fix commits, download repository snapshots, extract ground-truth diffs, scrub prompt leakage, then quality-score and tier-classify.

3
Prompt Models

Each model receives CVE description, CWE guidance, affected files, and vulnerable source. Must generate a unified diff patch.

4
Judge Patches

Claude Opus 4.6 compares candidate against reference: root cause addressed, no new vulnerabilities, correct scope. Score 0.0–1.0.

Dataset

VulnBench at a Glance

1,650 real CVE instances in the full benchmark, with a curated 200-instance evaluation subset from 200 unique repositories balanced across difficulty tiers.

1,650
Full Benchmark
200
Curated Subset
55
CWE Types
7
Ecosystems
36
Mean Lines
1.9
Mean Files
Severity Distribution
Critical
21
High
42
Medium
137
Top CWEs
CWE-79 (XSS)
38
CWE-22 (Path Trav)
25
CWE-400 (DoS)
25
CWE-20 (Input Val)
23
CWE-94 (Code Inj)
19
Ecosystems
npm
134
pip
54
Maven
5
RubyGems
3
Composer
2
Rust
1
Swift
1
CVE Year Range
2026
16
2025
55
2024
48
2023
26
2022
24
Earlier
31

Run the Benchmark Yourself

VulnBench is fully open source and reproducible. Evaluate any LLM on the curated 200-instance subset with a single command.

View on GitHub
$ vulnbench run \
    --model claude-opus-4.6 \
    --subset vulnbench-full \
    --judge claude-opus-4.6

  [████████████████████░░░░] 1354/1650

  Model      Claude Opus 4.6
  Pass Rate  16.6%
  Score      0.417
  Passed     273/1650
  Time       19.3s