RSA Conference 2026 · Ghost Security

Can LLMs Fix Real-World
Security Vulnerabilities?

We tested 16 frontier language models on the curated VulnBench-200 evaluation subset. The best model fixes 57% of those instances. The median: only 22%.

16
Models Tested
200
Real CVEs
55
CWE Types
7
Ecosystems

What We Learned

Even the best models leave nearly half of real vulnerabilities unpatched. Performance varies dramatically across model families, vulnerability types, and cost tiers.

57%

Best Model Pass Rate

GPT-5.3 Codex leads the benchmark, successfully patching 114 of 200 real vulnerabilities — 11.5 percentage points ahead of second place.

22%

Median Model Pass Rate

The median model fixes roughly 1 in 5 vulnerabilities. There is a sharp drop-off after the top 5 models, which all exceed 40%.

$0.001

Cheapest Per-Instance Cost

DeepSeek V3.2 achieves 23.5% pass rate at $0.001 per instance — 60x cheaper than the top model with half the accuracy.

61%

Max Tier 1 Performance

Even on the easiest vulnerabilities (XSS, SQL injection), no model exceeds 62%. Pattern-matching fixes are not yet solved.

~0%

Rust & Swift Coverage

Most models score 0% on Rust and Swift CVEs. Security patch generation remains heavily skewed toward JavaScript and Python ecosystems.

$49

Total Eval Cost

All 16 models evaluated on all 200 instances for under $50 total via OpenRouter — making this benchmark highly reproducible.

VulnBench-200 Leaderboard

All models evaluated on the same 200 CVE instances using Claude Opus 4.6 as judge. Temperature 0.0, max 4096 tokens.

# Model Pass Rate Score Passed Avg Time Cost
1 GPT-5.3 Codex
OpenAI
57.0%
0.514 111/200 40.5s $6.24
2 Claude Opus 4.6
Anthropic
45.5%
0.457 91/200 18.8s $6.46
3 GPT-5.4
OpenAI
42.5%
0.431 85/200 7.9s $1.94
4 Claude Sonnet 4.6
Anthropic
42.0%
0.419 84/200 13.7s $3.65
5 GPT-5.2
OpenAI
42.0%
0.374 84/200 67.6s $8.50
6 Gemini 3 Flash
Google
33.5%
0.373 67/200 4.2s $0.34
7 Kimi K2.5
Moonshot AI
29.0%
0.301 58/200 261s $1.98
8 Claude Haiku 4.5
Anthropic
25.0%
0.317 50/200 7.0s $1.06
9 DeepSeek V3.2
DeepSeek
23.5%
0.314 47/200 36.1s $0.17
10 GLM-5
Zhipu AI
19.5%
0.268 39/200 80.1s $2.32
11 GPT-5 Mini
OpenAI
17.5%
0.307 35/200 27.6s $0.71
12 Qwen 3.5-27B
Alibaba
15.0%
0.259 30/200 163s $4.53
13 MiniMax M2.5
MiniMax
14.5%
0.248 29/200 49.0s $0.47
14 Qwen 3.5-35B-A3B
Alibaba
11.0%
0.235 22/200 28.8s $1.01
15 Gemini 3.1 Pro
Google
6.5%
0.093 13/200 60.0s $9.96
16 Step 3.5 Flash
StepFun
0.0%
0.000 0/200 24.4s $0.00

Performance by Vulnerability Tier

Vulnerabilities are categorized into three difficulty tiers based on the complexity of the required fix. The benchmark is balanced: 67 / 67 / 66 instances across tiers.

Model Tier 1 Pattern Tier 2 Logic Tier 3 Deep
GPT-5.3 Codex61.2%53.7%51.5%
Claude Opus 4.650.7%43.3%42.4%
GPT-5.446.3%32.8%48.5%
Claude Sonnet 4.641.8%40.3%43.9%
GPT-5.247.8%41.8%36.4%
Gemini 3 Flash35.8%26.9%37.9%
Kimi K2.535.8%25.4%25.8%
Claude Haiku 4.523.9%26.9%24.2%
DeepSeek V3.225.4%22.4%22.7%
Tier 1 — Pattern

XSS (CWE-79), SQL injection (CWE-89), path traversal (CWE-22). Fixes follow well-known patterns: escape output, parameterize queries, sanitize paths.

Tier 2 — Logic

Authorization (CWE-862/863), CSRF (CWE-352), information disclosure (CWE-200). Requires understanding application logic to add missing checks.

Tier 3 — Deep

Code injection (CWE-94), resource exhaustion (CWE-400), input validation (CWE-20). Requires deep reasoning about execution flow and system behavior.

How VulnBench Works

Each model receives the vulnerability description, affected source code, and CWE-specific guidance. An LLM judge (Claude Opus 4.6) scores the candidate patch against the ground-truth fix commit.

1

Collect Real CVEs

10,000+ CVEs from the GitHub Advisory Database, enriched with CVSS scores and CWE IDs from NVD.

2

Build Instances

Select CVEs with fix commits, download repository snapshots, extract ground-truth diffs, scrub prompt leakage, then quality-score and tier-classify.

3

Prompt Models

Each model receives CVE description, CWE guidance, affected files, and vulnerable source. Must generate a unified diff patch.

4

Judge Patches

Claude Opus 4.6 compares candidate against reference: root cause addressed, no new vulnerabilities, correct scope. Score 0.0–1.0.

VulnBench-200 at a Glance

200 curated instances from 200 unique GitHub repositories, balanced across difficulty tiers.

200
CVE Instances
200
Unique Repos
55
CWE Types
7
Ecosystems
36
Mean Lines Changed
1.9
Mean Files Changed

Severity Distribution

Critical: 21   High: 42   Medium: 137

Top CWEs

CWE-79 (XSS): 38   CWE-22 (Path Traversal): 25   CWE-400 (DoS): 25   CWE-20 (Input Val): 23   CWE-94 (Code Inj): 19

Ecosystems

npm: 134   pip: 54   Maven: 5   RubyGems: 3   Composer: 2   Rust: 1   Swift: 1

CVE Year Range

2026: 16   2025: 55   2024: 48   2023: 26   2022: 24   Earlier: 31

Run the Benchmark Yourself

VulnBench is fully open source and reproducible. Evaluate any LLM in under an hour for less than $10.