Public leaderboard for the curated VulnBench-200 subset. The full VulnBench dataset contains 1,650 CVEs.
| # | Model | Pass Rate | Passed | Mean Score | Tier 1 | Tier 2 | Tier 3 | Cost | $/Instance |
|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5.3 Codex OpenAI | 57.0% | 111/200 | 0.504 | 61.2% | 53.7% | 51.5% | $6.23 | $0.031 |
| 2 | Claude Opus 4.6 Anthropic | 45.5% | 91/200 | 0.457 | 50.7% | 43.3% | 42.4% | $6.46 | $0.032 |
| 3 | GPT-5.4 OpenAI | 42.5% | 85/200 | 0.431 | 46.3% | 32.8% | 48.5% | $1.94 | $0.010 |
| 4 | Claude Sonnet 4.6 Anthropic | 42.0% | 84/200 | 0.419 | 41.8% | 40.3% | 43.9% | $3.65 | $0.018 |
| 5 | GPT-5.2 OpenAI | 42.0% | 84/200 | 0.374 | 47.8% | 41.8% | 36.4% | $8.50 | $0.042 |
| 6 | Gemini 3 Flash Google | 33.5% | 67/200 | 0.373 | 35.8% | 26.9% | 37.9% | $0.34 | $0.002 |
| 7 | Kimi K2.5 Moonshot | 29.0% | 58/200 | 0.301 | 35.8% | 25.4% | 25.8% | $1.98 | $0.010 |
| 8 | Claude Haiku 4.5 Anthropic | 25.0% | 50/200 | 0.317 | 23.9% | 26.9% | 24.2% | $1.06 | $0.005 |
| 9 | DeepSeek V3.2 DeepSeek | 23.5% | 47/200 | 0.314 | 25.4% | 22.4% | 22.7% | $0.17 | $0.001 |
| 10 | GLM-5 Zhipu | 19.5% | 39/200 | 0.268 | 17.9% | 19.4% | 21.2% | $2.32 | $0.012 |
| 11 | GPT-5 Mini OpenAI | 17.5% | 35/200 | 0.307 | 19.4% | 14.9% | 18.2% | $0.71 | $0.004 |
| 12 | Qwen 3.5-27B Qwen | 15.0% | 30/200 | 0.259 | 14.9% | 17.9% | 12.1% | $4.53 | $0.023 |
| 13 | MiniMax M2.5 MiniMax | 14.5% | 29/200 | 0.248 | 14.9% | 11.9% | 16.7% | $0.47 | $0.002 |
| 14 | Qwen 3.5-35B-A3B Qwen | 11.0% | 22/200 | 0.236 | 14.9% | 9.0% | 9.1% | $1.01 | $0.005 |
| 15 | Gemini 3.1 Pro Google | 6.5% | 13/200 | 0.094 | 7.5% | 6.0% | 6.1% | $9.96 | $0.050 |
| 16 | Step 3.5 Flash StepFun | 0.0% | 0/200 | 0.000 | 0.0% | 0.0% | 0.0% | $0.00 | $0.000 |
All 16 models evaluated on identical 200 CVE instances. Judge: Claude Opus 4.6. Total evaluation cost: $49.34.
| Model | npm | pip | Maven | RubyGems | Composer | Rust | Swift |
|---|---|---|---|---|---|---|---|
| GPT-5.3 Codex | 56.0% | 55.6% | 40.0% | 66.7% | 100% | 0% | 0% |
| Claude Opus 4.6 | 44.8% | 42.6% | 60.0% | 66.7% | 100% | 100% | 0% |
| GPT-5.4 | 41.0% | 46.3% | 20.0% | 66.7% | 50.0% | 100% | 0% |
| Claude Sonnet 4.6 | 42.5% | 40.7% | 20.0% | 66.7% | 50.0% | 100% | 0% |
| GPT-5.2 | 38.8% | 50.0% | 40.0% | 66.7% | 50.0% | 0% | 0% |
| Gemini 3 Flash | 36.6% | 27.8% | 0% | 66.7% | 0% | 100% | 0% |
| Kimi K2.5 | 32.1% | 27.8% | 0% | 0% | 0% | 0% | 0% |
| Claude Haiku 4.5 | 27.6% | 16.7% | 40.0% | 66.7% | 0% | 0% | 0% |
Note: Rust (1 instance) and Swift (1 instance) have very small sample sizes. Composer has 2 instances, RubyGems 3, Maven 5.
| Model | Pass Rate | Total Cost | Cost / Instance | Cost / Pass | Avg Gen Time |
|---|---|---|---|---|---|
| DeepSeek V3.2 | 23.5% | $0.17 | $0.001 | $0.004 | 36.1s |
| Gemini 3 Flash | 33.5% | $0.34 | $0.002 | $0.005 | 4.2s |
| MiniMax M2.5 | 14.5% | $0.47 | $0.002 | $0.016 | 49.0s |
| GPT-5.4 | 42.5% | $1.94 | $0.010 | $0.023 | 7.9s |
| Claude Sonnet 4.6 | 42.0% | $3.65 | $0.018 | $0.043 | 13.7s |
| GPT-5.3 Codex | 57.0% | $6.23 | $0.031 | $0.056 | 40.5s |
| Claude Opus 4.6 | 45.5% | $6.46 | $0.032 | $0.071 | 18.8s |
| Gemini 3.1 Pro | 6.5% | $9.96 | $0.050 | $0.766 | 60.0s |
200 real CVE instances from the GitHub Advisory Database, covering 200 unique repos across 7 ecosystems (npm, pip, Maven, RubyGems, Composer, Rust, Swift) and 55 CWE types. Balanced across 3 difficulty tiers. Mean patch size: 36 lines across 1.9 files.
Each model receives CVE text, CWE guidance, and runtime source context when localization succeeds. Gold affected-file hints are supported only as a separate ablation mode.
Claude Opus 4.6 compares each candidate patch against the ground-truth fix. Scores 0.0-1.0 based on: root cause addressed, no new vulnerabilities introduced, scope coverage. Pass threshold: score ≥ 0.5 or judge verdict "pass".