Skip to main content
All benchmarks run on claude-sonnet-4-6 via the Claude Code CLI. Same prompt, same model, same target. The only variable is whether the skill is loaded.

How benchmarks work

Each benchmark runs the identical prompt twice once with no skill loaded (the agent improvises), once with the skill loaded as system context. The comparison measures:
  • Turns to completion how many back-and-forth exchanges before a usable result
  • Tokens used total input + output tokens consumed
  • Time wall-clock seconds from prompt to final output
  • Output quality rated Incomplete → Partial → Good → Complete
The goal is not to make the agent look bad without skills it’s to show exactly what structured methodology adds, and where.

Results at a glance

SkillTurns (no skill)Turns (skill)Key gain
idor-hunter11+120% findings (5 → 11 IDORs)
find-skills21-97% time (185s → 5s)
scope-grill21-93% tokens, -90% time
hexstrike-forge0 findings2 confirmed0 → 2 report-ready findings
ssrf-hunter1 (false positives)1 (confirmed exploit)False positives → confirmed RCE
xss-hunter7/10 found, 8 min9/10 found, 2 min+29% coverage, -75% time
jwt-cracker31-67% turns, -75% time
control-lookup31-2 turns, -37% tokens
cvss-scorer1 (verbose)1 (concise)-63% tokens, -68% time
engagement-handoff21-58% tokens, -56% time
compliance-gap-analyzer1 (Partial)1 (Complete)+2 quality levels
remediation-planner21-57% tokens, -50% time
risk-assessor21-57% tokens, -32% time
vuln-diagnose21-42% tokens, -53% time
attack-surface1 (Partial)1 (Good)-37% tokens, -36% time
nuclei-template-writer1 (Good)1 (Complete)+1 quality level
ssti-hunter1 (slow)1 (fast)-38% time, no wasted turns
pentest-report11-24% tokens, -22% time


Detailed results

idor-hunter

+120% findings on the same target with the same prompt.
MetricWithout SkillWith SkillImprovement
Turns to complete11⚪ 0%
Total tokens~3,521~1,696🟢 -52%
Time77s40s🟡 -48%
IDOR findings511🟢 +120%
Without the skill, the agent applied a shallow approach and stopped after the most obvious vectors 5 IDORs found. With idor-hunter, it followed a complete enumeration across path params, query strings, JSON bodies, and headers finding 11 IDORs on the same target. 6 vulnerabilities that would have been missed in a real engagement.

find-skills

-97% time. 185 seconds → 5 seconds.
MetricWithout SkillWith SkillDifference
Turns to complete21🟢 -1 turn
Response tokens~4,831~233🟢 -95%
Total time185s5s🟢 -97%
Output qualityIncompleteComplete🟢 +3 levels
Without the skill, the agent didn’t know what skills exist it improvised a Python script suggestion and confused the user. With find-skills, it immediately identified finding-writer as the right skill and provided the install command. One turn, 5 seconds.

hexstrike-forge

0 confirmed findings → 2 report-ready findings with CVSS + PoC + remediation.
MetricWithout SkillWith SkillDelta
Phases executed1 (ad hoc)5 (structured)🟢 +400%
Tool calls made718🟢 +157%
Tool failures recovered0 / 23 / 3🟢 100% vs 0%
Confirmed findings02🟢 0 → 2
False positives discardedunmeasured6 of 8 flags🟢 clean triage
Engagement completenessPartialFull🟢
Prompt: “pentest scanme.nmap.org” four words, same target, same MCP server. Without the skill, the agent ran 7 tools ad hoc, hit the same tool bug twice without recovering, misunderstood the workflow tool, and produced zero deliverables just raw JSON output. With hexstrike-forge, it ran 5 structured phases, recovered from 3 tool failures, discarded 6 false positives, and produced 2 report-ready findings.
HexStrike without the skill is a toolbox. HexStrike with the skill is an engagement.

ssrf-hunter

False positives vs a confirmed exploit the most critical delta.
MetricWithout SkillWith SkillImprovement
Turns to complete11⚪ 0%
Total tokens~5,001~1,953🟢 -61%
Time107s45s🟢 -58%
Output qualityFalse positivesConfirmed exploit🟢 Critical
Without the skill, the agent reported likely SSRF without verification a result that fails triage and wastes the program’s time. With ssrf-hunter, it followed a structured confirmation sequence (OOB callback → loopback → cloud metadata) and produced a verified working payload. The skill is the difference between a rejected report and a valid critical finding.

xss-hunter

9/10 XSS found in 2 minutes vs 7/10 in 8 minutes.
MetricWithout SkillWith SkillImprovement
Turns to complete11⚪ 0%
Total tokens~4,378~1,622🟢 -63%
Time~8 min~2 min🟢 -75%
XSS findings (out of 10)79🟢 +29%
Custom lab with 10 planted XSS vulnerabilities. Without the skill, the agent missed 3 including DOM-based and stored variants due to an incomplete coverage strategy and redundant recon steps. With xss-hunter, the pre-ordered test sequence (reflected → stored → DOM-based) eliminated redundancy and improved coverage.

jwt-cracker

3 turns → 1. -75% time.
MetricWithout SkillWith SkillImprovement
Turns to complete31🟢 -67%
Total tokens~14,458~4,866🟢 -66%
Time355s87s🟢 -75%
Without the skill, the agent needed 2 correction prompts before producing a usable JWT test. With jwt-cracker, complete structured output on the first turn phases, expected outputs, and interpretation annotated.

control-lookup

3 turns → 1. High user effort → Low.
MetricWithout SkillWith SkillDifference
Turns to complete31🟢 -2 turns
Response tokens~3,297~2,068🟢 -37%
Total time63s51s🟢 -19%
User effortHighLow🟢
Without the skill, the agent’s first response claimed it had already answered (it hadn’t) 2 more correction prompts needed. With control-lookup, it immediately produced the correct control card with cross-framework mappings to NIST CSF and PCI-DSS in a single turn.

cvss-scorer

-63% tokens, -68% time. Same score, no noise.
MetricWithout SkillWith SkillDifference
Turns to complete11⚪ 0
Response tokens~539~201🟢 -63%
Total time18s6s🟢 -68%
Both produce the correct CVSS vector. The skill version is 3 lines vector, score, one contextual note. Without it, the agent writes 500 tokens of explanation around the same answer. Fast scoring for a busy pentest workflow.

scope-grill

-93% tokens. -90% time. Structured scope collection vs a wall of legal text.
MetricWithout SkillWith SkillDifference
Turns to complete21🟢 -1 turn
Response tokens~2,332~165🟢 -93%
Total time43s4s🟢 -90%
Without the skill, the agent dumped a full legal disclaimer and engagement template overwhelming and not actionable. With scope-grill, it asked the first of 10 structured scoping questions and collected information one step at a time. Complete in 1 turn.

engagement-handoff

-58% tokens, -56% time. End-of-day status → structured handoff doc.
MetricWithout SkillWith SkillDifference
Turns to complete21🟢 -1 turn
Response tokens~2,499~1,060🟢 -58%
Total time52s23s🟢 -56%

compliance-gap-analyzer

Partial → Complete in 1 turn. -38% tokens.
MetricWithout SkillWith SkillDifference
Turns to complete11⚪ 0
Response tokens~4,317~2,661🟢 -38%
Total time88s57s🟢 -35%
Output qualityPartialComplete🟢 +2 levels

remediation-planner

2 turns → 1. -57% tokens.
MetricWithout SkillWith SkillDifference
Turns to complete21🟢 -1 turn
Response tokens~2,419~1,052🟢 -57%
Total time44s22s🟢 -50%

risk-assessor

2 turns → 1 for CVE emergency patch decisions.
MetricWithout SkillWith SkillDifference
Turns to complete21🟢 -1 turn
Response tokens~4,629~2,009🟢 -57%
Total time101s68s🟢 -32%

vuln-diagnose

-42% tokens, -53% time.
MetricWithout SkillWith SkillDifference
Turns to complete21🟢 -1 turn
Response tokens~2,255~1,315🟢 -42%
Total time55s26s🟢 -53%

attack-surface

-37% tokens. Partial → Good attack surface map.
MetricWithout SkillWith SkillDifference
Turns to complete11⚪ 0
Response tokens~4,039~2,562🟢 -37%
Total time88s56s🟢 -36%
Output qualityPartialGood🟢 +1 level

nuclei-template-writer

Good → Complete. Adds matcher strategy explanation the raw version skips.
MetricWithout SkillWith SkillDifference
Turns to complete11⚪ 0
Response tokens~1,632~1,585🟢 -3%
Output qualityGoodComplete🟢 +1 level

ssti-hunter

Same result, faster. No wasted turns guessing the template engine.
MetricWithout SkillWith SkillImprovement
Turns to complete11⚪ 0%
Total tokens~3,475~2,241🟡 -36%
Time105s65s🟡 -38%
Both runs found and exploited the SSTI. The difference is speed: without the skill, the agent spent extra turns guessing the template engine before picking payloads. With ssti-hunter, a deterministic polyglot detection sequence reached confirmed exploitation faster no engine guessing, no wasted turns.

pentest-report

Same quality, less overhead.
MetricWithout SkillWith SkillDifference
Turns to complete11⚪ 0
Response tokens~7,144~5,444🟢 -24%
Total time128s100s🟢 -22%

Reading the color codes

ColorMeaning
🟢Improvement
🟡Moderate improvement
🔴Trade-off more tokens/time for better quality
No change
Red is not always bad. Skills like bugbounty-reporter, js-analyzer, and check-exploit use more tokens because they produce more complete output. The benchmark shows the trade-off explicitly so you can decide whether it’s worth it for your workflow.