RSA Conference 2026 / Ghost Security

Can LLMs Fix Real-World
Security Vulnerabilities?

We tested 16 frontier language models on 1,650 real CVEs and a curated 200-instance subset. On VulnBench-200 (best-of-3), the best model fixes 22.5%. On the full 1,650-instance benchmark, the best model fixes 16.6%. The median model: under 4%.

Key Findings

What We Learned

On the curated 200-instance subset (best-of-3), even the best model patches fewer than 1 in 4 vulnerabilities. On the full 1,650-CVE benchmark, performance drops further — and the rankings shift significantly.

16.6%

Best on Full 1,650

Claude Opus 4.6 leads the full benchmark at 16.6% (273/1650), narrowly ahead of GPT-5.3 Codex at 16.4% (271/1650) — a ranking reversal from the curated subset.

22.5%

Best on Curated 200

GPT-5.3 Codex leads VulnBench-200 (best-of-3) at 22.5%, followed by GPT-5.4 at 18.5% and Claude Opus 4.6 at 16.0%.

3.4%

Full Benchmark Median

The median model fixes only 3.4% of the full 1,650-instance benchmark — roughly 1 in 30 real vulnerabilities.

0.417

Highest Full Benchmark Score

Claude Opus 4.6 achieves the highest mean judge score on the full benchmark, indicating consistently higher patch quality at scale.

Rankings Shift

Curated vs Full

GPT-5.4 drops from 2nd on the curated subset (18.5%) to 3.4% on the full benchmark. Claude Sonnet and Gemini Flash scale better, holding or improving their relative positions.

9.7%

Best Value Model

Gemini 3 Flash achieves 9.7% on the full benchmark (160/1650) at the lowest cost per instance — strong performance for a lightweight model.

#	Model	Pass Rate	Score	Passed	Avg Time	Cost
1	GPT-5.3 Codex OpenAI	22.5%	0.468	45/200	43.4s	$8.74
2	GPT-5.4 OpenAI	18.5%	0.407	37/200	7.3s	$4.81
3	Claude Opus 4.6 Anthropic	16.0%	0.404	32/200	19.6s	$10.17
4	GPT-5.2 OpenAI	15.0%	0.322	30/200	75.2s	$11.30
5	Claude Sonnet 4.6 Anthropic	10.5%	0.322	21/200	16.0s	$6.87
6	Gemini 3 Flash Google	7.5%	0.318	15/200	5.1s	$3.13
7	GLM-5 Zhipu AI	7.0%	0.249	14/200	91.9s	$4.26
8	Kimi K2.5 Moonshot AI	6.5%	0.228	13/200	65.7s	$3.47
9	Grok 4.1 Fast xAI	5.5%	0.273	11/200	61.1s	$3.46
10	GPT-5 Mini OpenAI	5.0%	0.275	10/200	25.4s	$3.63
11	DeepSeek V3.2 DeepSeek	4.5%	0.253	9/200	78.7s	$3.25
12	Claude Haiku 4.5 Anthropic	3.5%	0.263	7/200	7.0s	$3.95
13	Gemini 3.1 Pro Google	2.5%	0.093	5/200	44.4s	$9.60
14	MiniMax M2.5 MiniMax	1.5%	0.181	3/200	45.0s	$3.25
14	MiniMax M2.7 MiniMax	1.5%	0.099	3/200	22.8s	$1.74
16	Step 3.5 Flash StepFun	0.0%	0.000	0/200	44.0s	$0.00

#	Model	Pass Rate	Score	Passed	Avg Time
1	Claude Opus 4.6 Anthropic	16.6%	0.417	273/1650	19.3s
2	GPT-5.3 Codex OpenAI	16.4%	0.301	271/1650	23.4s
3	Claude Sonnet 4.6 Anthropic	12.2%	0.335	201/1650	16.5s
4	Gemini 3 Flash Google	9.7%	0.311	160/1650	8.8s
5	GPT-5.2 OpenAI	6.7%	0.129	111/1650	26.1s
6	GLM-5 Zhipu AI	5.9%	0.219	98/1650	90.6s
7	GPT-5 Mini OpenAI	5.1%	0.246	85/1650	18.6s
8	Claude Haiku 4.5 Anthropic	4.9%	0.259	80/1650	6.5s
9	Gemini 3.1 Pro Google	3.5%	0.097	58/1650	56.1s
10	GPT-5.4 OpenAI	3.4%	0.092	56/1650	4.5s
10	Kimi K2.5 Moonshot AI	3.4%	0.102	56/1650	25.7s
12	MiniMax M2.7 MiniMax	2.2%	0.114	36/1650	24.2s
13	MiniMax M2.5 MiniMax	2.1%	0.092	35/1650	23.7s
14	Grok 4.1 Fast xAI	1.9%	0.108	31/1650	25.6s
15	DeepSeek V3.2 DeepSeek	1.7%	0.094	28/1650	25.2s
16	Step 3.5 Flash StepFun	0.1%	0.001	2/1650	42.3s

Difficulty Tiers

Vulnerability Difficulty Classification

VulnBench-200 instances are classified into three difficulty tiers (67 / 67 / 66 instances).

Tier 1 — Pattern

XSS (CWE-79), SQL injection (CWE-89), path traversal (CWE-22). Fixes follow well-known patterns: escape output, parameterize queries, sanitize paths.

Tier 2 — Logic

Authorization (CWE-862/863), CSRF (CWE-352), information disclosure (CWE-200). Requires understanding application logic to add missing checks.

Tier 3 — Deep

Code injection (CWE-94), resource exhaustion (CWE-400), input validation (CWE-20). Requires deep reasoning about execution flow and system behavior.

Methodology

How VulnBench Works

Each model receives the vulnerability description, affected source code, and CWE-specific guidance. An LLM judge (Claude Opus 4.6) scores the candidate patch against the ground-truth fix commit.

Collect Real CVEs

10,000+ CVEs from the GitHub Advisory Database, enriched with CVSS scores and CWE IDs from NVD.

Build Instances

Select CVEs with fix commits, download repository snapshots, extract ground-truth diffs, scrub prompt leakage, then quality-score and tier-classify.

Prompt Models

Each model receives CVE description, CWE guidance, affected files, and vulnerable source. Must generate a unified diff patch.

Judge Patches

Claude Opus 4.6 compares candidate against reference: root cause addressed, no new vulnerabilities, correct scope. Score 0.0–1.0.

Dataset

VulnBench at a Glance

1,650 real CVE instances in the full benchmark, with a curated 200-instance evaluation subset from 200 unique repositories balanced across difficulty tiers.

1,650

Full Benchmark

200

Curated Subset

CWE Types

Ecosystems

Mean Lines

1.9

Mean Files

Severity Distribution

Critical

High

Medium

137

Top CWEs

CWE-79 (XSS)

CWE-22 (Path Trav)

CWE-400 (DoS)

CWE-20 (Input Val)

CWE-94 (Code Inj)

Ecosystems

npm

134

pip

Maven

RubyGems

Composer

Rust

Swift

CVE Year Range

2026

2025

2024

2023

2022

Earlier

Run the Benchmark Yourself

VulnBench is fully open source and reproducible. Evaluate any LLM on the curated 200-instance subset with a single command.

View on GitHub →

$ vulnbench run \
    --model claude-opus-4.6 \
    --subset vulnbench-full \
    --judge claude-opus-4.6

  [████████████████████░░░░] 1354/1650

  Model      Claude Opus 4.6
  Pass Rate  16.6%
  Score      0.417
  Passed     273/1650
  Time       19.3s

Can LLMs Fix Real-World
Security Vulnerabilities?

What We Learned

VulnBench-200 Leaderboard

VulnBench-1650 Leaderboard

Vulnerability Difficulty Classification

How VulnBench Works

VulnBench at a Glance

Run the Benchmark Yourself

Can LLMs Fix Real-WorldSecurity Vulnerabilities?

What We Learned

VulnBench-200 Leaderboard

VulnBench-1650 Leaderboard

Vulnerability Difficulty Classification

How VulnBench Works

VulnBench at a Glance

Run the Benchmark Yourself

Can LLMs Fix Real-World
Security Vulnerabilities?