All benchmarks run on claude-sonnet-4-6 via the Claude Code CLI. Same prompt, same model, same target. The only variable is whether the skill is loaded.
How benchmarks work
Each benchmark runs the identical prompt twice once with no skill loaded (the agent improvises), once with the skill loaded as system context. The comparison measures:
- Turns to completion how many back-and-forth exchanges before a usable result
- Tokens used total input + output tokens consumed
- Time wall-clock seconds from prompt to final output
- Output quality rated Incomplete → Partial → Good → Complete
The goal is not to make the agent look bad without skills it’s to show exactly what structured methodology adds, and where.
Results at a glance
| Skill | Turns (no skill) | Turns (skill) | Key gain |
|---|
| idor-hunter | 1 | 1 | +120% findings (5 → 11 IDORs) |
| find-skills | 2 | 1 | -97% time (185s → 5s) |
| scope-grill | 2 | 1 | -93% tokens, -90% time |
| hexstrike-forge | 0 findings | 2 confirmed | 0 → 2 report-ready findings |
| ssrf-hunter | 1 (false positives) | 1 (confirmed exploit) | False positives → confirmed RCE |
| xss-hunter | 7/10 found, 8 min | 9/10 found, 2 min | +29% coverage, -75% time |
| jwt-cracker | 3 | 1 | -67% turns, -75% time |
| control-lookup | 3 | 1 | -2 turns, -37% tokens |
| cvss-scorer | 1 (verbose) | 1 (concise) | -63% tokens, -68% time |
| engagement-handoff | 2 | 1 | -58% tokens, -56% time |
| compliance-gap-analyzer | 1 (Partial) | 1 (Complete) | +2 quality levels |
| remediation-planner | 2 | 1 | -57% tokens, -50% time |
| risk-assessor | 2 | 1 | -57% tokens, -32% time |
| vuln-diagnose | 2 | 1 | -42% tokens, -53% time |
| attack-surface | 1 (Partial) | 1 (Good) | -37% tokens, -36% time |
| nuclei-template-writer | 1 (Good) | 1 (Complete) | +1 quality level |
| ssti-hunter | 1 (slow) | 1 (fast) | -38% time, no wasted turns |
| pentest-report | 1 | 1 | -24% tokens, -22% time |
Detailed results
idor-hunter
+120% findings on the same target with the same prompt.
| Metric | Without Skill | With Skill | Improvement |
|---|
| Turns to complete | 1 | 1 | ⚪ 0% |
| Total tokens | ~3,521 | ~1,696 | 🟢 -52% |
| Time | 77s | 40s | 🟡 -48% |
| IDOR findings | 5 | 11 | 🟢 +120% |
Without the skill, the agent applied a shallow approach and stopped after the most obvious vectors 5 IDORs found. With idor-hunter, it followed a complete enumeration across path params, query strings, JSON bodies, and headers finding 11 IDORs on the same target. 6 vulnerabilities that would have been missed in a real engagement.
find-skills
-97% time. 185 seconds → 5 seconds.
| Metric | Without Skill | With Skill | Difference |
|---|
| Turns to complete | 2 | 1 | 🟢 -1 turn |
| Response tokens | ~4,831 | ~233 | 🟢 -95% |
| Total time | 185s | 5s | 🟢 -97% |
| Output quality | Incomplete | Complete | 🟢 +3 levels |
Without the skill, the agent didn’t know what skills exist it improvised a Python script suggestion and confused the user. With find-skills, it immediately identified finding-writer as the right skill and provided the install command. One turn, 5 seconds.
hexstrike-forge
0 confirmed findings → 2 report-ready findings with CVSS + PoC + remediation.
| Metric | Without Skill | With Skill | Delta |
|---|
| Phases executed | 1 (ad hoc) | 5 (structured) | 🟢 +400% |
| Tool calls made | 7 | 18 | 🟢 +157% |
| Tool failures recovered | 0 / 2 | 3 / 3 | 🟢 100% vs 0% |
| Confirmed findings | 0 | 2 | 🟢 0 → 2 |
| False positives discarded | unmeasured | 6 of 8 flags | 🟢 clean triage |
| Engagement completeness | Partial | Full | 🟢 |
Prompt: “pentest scanme.nmap.org” four words, same target, same MCP server. Without the skill, the agent ran 7 tools ad hoc, hit the same tool bug twice without recovering, misunderstood the workflow tool, and produced zero deliverables just raw JSON output. With hexstrike-forge, it ran 5 structured phases, recovered from 3 tool failures, discarded 6 false positives, and produced 2 report-ready findings.
HexStrike without the skill is a toolbox. HexStrike with the skill is an engagement.
ssrf-hunter
False positives vs a confirmed exploit the most critical delta.
| Metric | Without Skill | With Skill | Improvement |
|---|
| Turns to complete | 1 | 1 | ⚪ 0% |
| Total tokens | ~5,001 | ~1,953 | 🟢 -61% |
| Time | 107s | 45s | 🟢 -58% |
| Output quality | False positives | Confirmed exploit | 🟢 Critical |
Without the skill, the agent reported likely SSRF without verification a result that fails triage and wastes the program’s time. With ssrf-hunter, it followed a structured confirmation sequence (OOB callback → loopback → cloud metadata) and produced a verified working payload. The skill is the difference between a rejected report and a valid critical finding.
xss-hunter
9/10 XSS found in 2 minutes vs 7/10 in 8 minutes.
| Metric | Without Skill | With Skill | Improvement |
|---|
| Turns to complete | 1 | 1 | ⚪ 0% |
| Total tokens | ~4,378 | ~1,622 | 🟢 -63% |
| Time | ~8 min | ~2 min | 🟢 -75% |
| XSS findings (out of 10) | 7 | 9 | 🟢 +29% |
Custom lab with 10 planted XSS vulnerabilities. Without the skill, the agent missed 3 including DOM-based and stored variants due to an incomplete coverage strategy and redundant recon steps. With xss-hunter, the pre-ordered test sequence (reflected → stored → DOM-based) eliminated redundancy and improved coverage.
jwt-cracker
3 turns → 1. -75% time.
| Metric | Without Skill | With Skill | Improvement |
|---|
| Turns to complete | 3 | 1 | 🟢 -67% |
| Total tokens | ~14,458 | ~4,866 | 🟢 -66% |
| Time | 355s | 87s | 🟢 -75% |
Without the skill, the agent needed 2 correction prompts before producing a usable JWT test. With jwt-cracker, complete structured output on the first turn phases, expected outputs, and interpretation annotated.
control-lookup
3 turns → 1. High user effort → Low.
| Metric | Without Skill | With Skill | Difference |
|---|
| Turns to complete | 3 | 1 | 🟢 -2 turns |
| Response tokens | ~3,297 | ~2,068 | 🟢 -37% |
| Total time | 63s | 51s | 🟢 -19% |
| User effort | High | Low | 🟢 |
Without the skill, the agent’s first response claimed it had already answered (it hadn’t) 2 more correction prompts needed. With control-lookup, it immediately produced the correct control card with cross-framework mappings to NIST CSF and PCI-DSS in a single turn.
cvss-scorer
-63% tokens, -68% time. Same score, no noise.
| Metric | Without Skill | With Skill | Difference |
|---|
| Turns to complete | 1 | 1 | ⚪ 0 |
| Response tokens | ~539 | ~201 | 🟢 -63% |
| Total time | 18s | 6s | 🟢 -68% |
Both produce the correct CVSS vector. The skill version is 3 lines vector, score, one contextual note. Without it, the agent writes 500 tokens of explanation around the same answer. Fast scoring for a busy pentest workflow.
scope-grill
-93% tokens. -90% time. Structured scope collection vs a wall of legal text.
| Metric | Without Skill | With Skill | Difference |
|---|
| Turns to complete | 2 | 1 | 🟢 -1 turn |
| Response tokens | ~2,332 | ~165 | 🟢 -93% |
| Total time | 43s | 4s | 🟢 -90% |
Without the skill, the agent dumped a full legal disclaimer and engagement template overwhelming and not actionable. With scope-grill, it asked the first of 10 structured scoping questions and collected information one step at a time. Complete in 1 turn.
engagement-handoff
-58% tokens, -56% time. End-of-day status → structured handoff doc.
| Metric | Without Skill | With Skill | Difference |
|---|
| Turns to complete | 2 | 1 | 🟢 -1 turn |
| Response tokens | ~2,499 | ~1,060 | 🟢 -58% |
| Total time | 52s | 23s | 🟢 -56% |
compliance-gap-analyzer
Partial → Complete in 1 turn. -38% tokens.
| Metric | Without Skill | With Skill | Difference |
|---|
| Turns to complete | 1 | 1 | ⚪ 0 |
| Response tokens | ~4,317 | ~2,661 | 🟢 -38% |
| Total time | 88s | 57s | 🟢 -35% |
| Output quality | Partial | Complete | 🟢 +2 levels |
2 turns → 1. -57% tokens.
| Metric | Without Skill | With Skill | Difference |
|---|
| Turns to complete | 2 | 1 | 🟢 -1 turn |
| Response tokens | ~2,419 | ~1,052 | 🟢 -57% |
| Total time | 44s | 22s | 🟢 -50% |
risk-assessor
2 turns → 1 for CVE emergency patch decisions.
| Metric | Without Skill | With Skill | Difference |
|---|
| Turns to complete | 2 | 1 | 🟢 -1 turn |
| Response tokens | ~4,629 | ~2,009 | 🟢 -57% |
| Total time | 101s | 68s | 🟢 -32% |
vuln-diagnose
-42% tokens, -53% time.
| Metric | Without Skill | With Skill | Difference |
|---|
| Turns to complete | 2 | 1 | 🟢 -1 turn |
| Response tokens | ~2,255 | ~1,315 | 🟢 -42% |
| Total time | 55s | 26s | 🟢 -53% |
attack-surface
-37% tokens. Partial → Good attack surface map.
| Metric | Without Skill | With Skill | Difference |
|---|
| Turns to complete | 1 | 1 | ⚪ 0 |
| Response tokens | ~4,039 | ~2,562 | 🟢 -37% |
| Total time | 88s | 56s | 🟢 -36% |
| Output quality | Partial | Good | 🟢 +1 level |
nuclei-template-writer
Good → Complete. Adds matcher strategy explanation the raw version skips.
| Metric | Without Skill | With Skill | Difference |
|---|
| Turns to complete | 1 | 1 | ⚪ 0 |
| Response tokens | ~1,632 | ~1,585 | 🟢 -3% |
| Output quality | Good | Complete | 🟢 +1 level |
ssti-hunter
Same result, faster. No wasted turns guessing the template engine.
| Metric | Without Skill | With Skill | Improvement |
|---|
| Turns to complete | 1 | 1 | ⚪ 0% |
| Total tokens | ~3,475 | ~2,241 | 🟡 -36% |
| Time | 105s | 65s | 🟡 -38% |
Both runs found and exploited the SSTI. The difference is speed: without the skill, the agent spent extra turns guessing the template engine before picking payloads. With ssti-hunter, a deterministic polyglot detection sequence reached confirmed exploitation faster no engine guessing, no wasted turns.
pentest-report
Same quality, less overhead.
| Metric | Without Skill | With Skill | Difference |
|---|
| Turns to complete | 1 | 1 | ⚪ 0 |
| Response tokens | ~7,144 | ~5,444 | 🟢 -24% |
| Total time | 128s | 100s | 🟢 -22% |
Reading the color codes
| Color | Meaning |
|---|
| 🟢 | Improvement |
| 🟡 | Moderate improvement |
| 🔴 | Trade-off more tokens/time for better quality |
| ⚪ | No change |
Red is not always bad. Skills like bugbounty-reporter, js-analyzer, and check-exploit use more tokens because they produce more complete output. The benchmark shows the trade-off explicitly so you can decide whether it’s worth it for your workflow.