VulnBench Results

Public leaderboard for the curated VulnBench-200 subset. The full VulnBench dataset contains 1,650 CVEs.

16 Models Evaluated 200 Curated Instances 55 CWE Types 7 Ecosystems $49.34 Total Cost
57.0%
Best Curated-200 Pass Rate
~22%
Median Model Pass Rate
$0.001
Cheapest per Instance (DeepSeek V3.2)
61.2%
Best Tier 1 Score (GPT-5.3 Codex)

Leaderboard

GPT-5.3 Codex
57.0%
Claude Opus 4.6
45.5%
GPT-5.4
42.5%
Claude Sonnet 4.6
42.0%
GPT-5.2
42.0%
Gemini 3 Flash
33.5%
Kimi K2.5
29.0%
Claude Haiku 4.5
25.0%
DeepSeek V3.2
23.5%
GLM-5
19.5%
GPT-5 Mini
17.5%
Qwen 3.5-27B
15.0%
MiniMax M2.5
14.5%
Qwen 3.5-35B-A3B
11.0%
Gemini 3.1 Pro
6.5%
Step 3.5 Flash
0.0%

Detailed Results

# Model Pass Rate Passed Mean Score Tier 1 Tier 2 Tier 3 Cost $/Instance
1 GPT-5.3 Codex OpenAI 57.0% 111/200 0.504 61.2% 53.7% 51.5% $6.23 $0.031
2 Claude Opus 4.6 Anthropic 45.5% 91/200 0.457 50.7% 43.3% 42.4% $6.46 $0.032
3 GPT-5.4 OpenAI 42.5% 85/200 0.431 46.3% 32.8% 48.5% $1.94 $0.010
4 Claude Sonnet 4.6 Anthropic 42.0% 84/200 0.419 41.8% 40.3% 43.9% $3.65 $0.018
5 GPT-5.2 OpenAI 42.0% 84/200 0.374 47.8% 41.8% 36.4% $8.50 $0.042
6 Gemini 3 Flash Google 33.5% 67/200 0.373 35.8% 26.9% 37.9% $0.34 $0.002
7 Kimi K2.5 Moonshot 29.0% 58/200 0.301 35.8% 25.4% 25.8% $1.98 $0.010
8 Claude Haiku 4.5 Anthropic 25.0% 50/200 0.317 23.9% 26.9% 24.2% $1.06 $0.005
9 DeepSeek V3.2 DeepSeek 23.5% 47/200 0.314 25.4% 22.4% 22.7% $0.17 $0.001
10 GLM-5 Zhipu 19.5% 39/200 0.268 17.9% 19.4% 21.2% $2.32 $0.012
11 GPT-5 Mini OpenAI 17.5% 35/200 0.307 19.4% 14.9% 18.2% $0.71 $0.004
12 Qwen 3.5-27B Qwen 15.0% 30/200 0.259 14.9% 17.9% 12.1% $4.53 $0.023
13 MiniMax M2.5 MiniMax 14.5% 29/200 0.248 14.9% 11.9% 16.7% $0.47 $0.002
14 Qwen 3.5-35B-A3B Qwen 11.0% 22/200 0.236 14.9% 9.0% 9.1% $1.01 $0.005
15 Gemini 3.1 Pro Google 6.5% 13/200 0.094 7.5% 6.0% 6.1% $9.96 $0.050
16 Step 3.5 Flash StepFun 0.0% 0/200 0.000 0.0% 0.0% 0.0% $0.00 $0.000

All 16 models evaluated on identical 200 CVE instances. Judge: Claude Opus 4.6. Total evaluation cost: $49.34.

Performance by Difficulty Tier

Tier 1 — Pattern Matching

XSS, SQL injection, path traversal — well-known fix patterns (67 instances)
GPT-5.3 Codex
61.2%
Claude Opus 4.6
50.7%
GPT-5.2
47.8%
GPT-5.4
46.3%
Claude Sonnet 4.6
41.8%
Gemini 3 Flash
35.8%

Tier 2 — Logic Fixes

Authorization, CSRF, info disclosure — requires understanding app logic (67 instances)
GPT-5.3 Codex
53.7%
Claude Opus 4.6
43.3%
GPT-5.2
41.8%
Claude Sonnet 4.6
40.3%
GPT-5.4
32.8%
Gemini 3 Flash
26.9%

Tier 3 — Deep Reasoning

Code injection, resource exhaustion, input validation — requires deep analysis (66 instances)
GPT-5.3 Codex
51.5%
GPT-5.4
48.5%
Claude Sonnet 4.6
43.9%
Claude Opus 4.6
42.4%
Gemini 3 Flash
37.9%
GPT-5.2
36.4%

Performance by Ecosystem (Top 8 Models)

Model npm pip Maven RubyGems Composer Rust Swift
GPT-5.3 Codex 56.0% 55.6% 40.0% 66.7% 100% 0% 0%
Claude Opus 4.6 44.8% 42.6% 60.0% 66.7% 100% 100% 0%
GPT-5.4 41.0% 46.3% 20.0% 66.7% 50.0% 100% 0%
Claude Sonnet 4.6 42.5% 40.7% 20.0% 66.7% 50.0% 100% 0%
GPT-5.2 38.8% 50.0% 40.0% 66.7% 50.0% 0% 0%
Gemini 3 Flash 36.6% 27.8% 0% 66.7% 0% 100% 0%
Kimi K2.5 32.1% 27.8% 0% 0% 0% 0% 0%
Claude Haiku 4.5 27.6% 16.7% 40.0% 66.7% 0% 0% 0%

Note: Rust (1 instance) and Swift (1 instance) have very small sample sizes. Composer has 2 instances, RubyGems 3, Maven 5.

Cost Efficiency

Model Pass Rate Total Cost Cost / Instance Cost / Pass Avg Gen Time
DeepSeek V3.2 23.5% $0.17 $0.001 $0.004 36.1s
Gemini 3 Flash 33.5% $0.34 $0.002 $0.005 4.2s
MiniMax M2.5 14.5% $0.47 $0.002 $0.016 49.0s
GPT-5.4 42.5% $1.94 $0.010 $0.023 7.9s
Claude Sonnet 4.6 42.0% $3.65 $0.018 $0.043 13.7s
GPT-5.3 Codex 57.0% $6.23 $0.031 $0.056 40.5s
Claude Opus 4.6 45.5% $6.46 $0.032 $0.071 18.8s
Gemini 3.1 Pro 6.5% $9.96 $0.050 $0.766 60.0s

Methodology

Dataset

200 real CVE instances from the GitHub Advisory Database, covering 200 unique repos across 7 ecosystems (npm, pip, Maven, RubyGems, Composer, Rust, Swift) and 55 CWE types. Balanced across 3 difficulty tiers. Mean patch size: 36 lines across 1.9 files.

Prompt

Each model receives CVE text, CWE guidance, and runtime source context when localization succeeds. Gold affected-file hints are supported only as a separate ablation mode.

Scoring (LLM-as-Judge)

Claude Opus 4.6 compares each candidate patch against the ground-truth fix. Scores 0.0-1.0 based on: root cause addressed, no new vulnerabilities introduced, scope coverage. Pass threshold: score ≥ 0.5 or judge verdict "pass".