Benchmarks

All benchmarks run on claude-sonnet-4-6 via the Claude Code CLI. Same prompt, same model, same target. The only variable is whether the skill is loaded.

How benchmarks work

Each benchmark runs the identical prompt twice once with no skill loaded (the agent improvises), once with the skill loaded as system context. The comparison measures:

Turns to completion how many back-and-forth exchanges before a usable result
Tokens used total input + output tokens consumed
Time wall-clock seconds from prompt to final output
Output quality rated Incomplete → Partial → Good → Complete

The goal is not to make the agent look bad without skills it’s to show exactly what structured methodology adds, and where.

Results at a glance

Skill	Turns (no skill)	Turns (skill)	Key gain
idor-hunter	1	1	+120% findings (5 → 11 IDORs)
find-skills	2	1	-97% time (185s → 5s)
scope-grill	2	1	-93% tokens, -90% time
hexstrike-forge	0 findings	2 confirmed	0 → 2 report-ready findings
ssrf-hunter	1 (false positives)	1 (confirmed exploit)	False positives → confirmed RCE
xss-hunter	7/10 found, 8 min	9/10 found, 2 min	+29% coverage, -75% time
jwt-cracker	3	1	-67% turns, -75% time
control-lookup	3	1	-2 turns, -37% tokens
cvss-scorer	1 (verbose)	1 (concise)	-63% tokens, -68% time
engagement-handoff	2	1	-58% tokens, -56% time
compliance-gap-analyzer	1 (Partial)	1 (Complete)	+2 quality levels
remediation-planner	2	1	-57% tokens, -50% time
risk-assessor	2	1	-57% tokens, -32% time
vuln-diagnose	2	1	-42% tokens, -53% time
attack-surface	1 (Partial)	1 (Good)	-37% tokens, -36% time
nuclei-template-writer	1 (Good)	1 (Complete)	+1 quality level
ssti-hunter	1 (slow)	1 (fast)	-38% time, no wasted turns
pentest-report	1	1	-24% tokens, -22% time

Detailed results

idor-hunter

+120% findings on the same target with the same prompt.

Metric	Without Skill	With Skill	Improvement
Turns to complete	1	1	⚪ 0%
Total tokens	~3,521	~1,696	🟢 -52%
Time	77s	40s	🟡 -48%
IDOR findings	5	11	🟢 +120%

Without the skill, the agent applied a shallow approach and stopped after the most obvious vectors 5 IDORs found. With idor-hunter, it followed a complete enumeration across path params, query strings, JSON bodies, and headers finding 11 IDORs on the same target. 6 vulnerabilities that would have been missed in a real engagement.

find-skills

-97% time. 185 seconds → 5 seconds.

Metric	Without Skill	With Skill	Difference
Turns to complete	2	1	🟢 -1 turn
Response tokens	~4,831	~233	🟢 -95%
Total time	185s	5s	🟢 -97%
Output quality	Incomplete	Complete	🟢 +3 levels

Without the skill, the agent didn’t know what skills exist it improvised a Python script suggestion and confused the user. With find-skills, it immediately identified finding-writer as the right skill and provided the install command. One turn, 5 seconds.

hexstrike-forge

0 confirmed findings → 2 report-ready findings with CVSS + PoC + remediation.

Metric	Without Skill	With Skill	Delta
Phases executed	1 (ad hoc)	5 (structured)	🟢 +400%
Tool calls made	7	18	🟢 +157%
Tool failures recovered	0 / 2	3 / 3	🟢 100% vs 0%
Confirmed findings	0	2	🟢 0 → 2
False positives discarded	unmeasured	6 of 8 flags	🟢 clean triage
Engagement completeness	Partial	Full	🟢

Prompt: “pentest scanme.nmap.org” four words, same target, same MCP server. Without the skill, the agent ran 7 tools ad hoc, hit the same tool bug twice without recovering, misunderstood the workflow tool, and produced zero deliverables just raw JSON output. With hexstrike-forge, it ran 5 structured phases, recovered from 3 tool failures, discarded 6 false positives, and produced 2 report-ready findings.

HexStrike without the skill is a toolbox. HexStrike with the skill is an engagement.

ssrf-hunter

False positives vs a confirmed exploit the most critical delta.

Metric	Without Skill	With Skill	Improvement
Turns to complete	1	1	⚪ 0%
Total tokens	~5,001	~1,953	🟢 -61%
Time	107s	45s	🟢 -58%
Output quality	False positives	Confirmed exploit	🟢 Critical

Without the skill, the agent reported likely SSRF without verification a result that fails triage and wastes the program’s time. With ssrf-hunter, it followed a structured confirmation sequence (OOB callback → loopback → cloud metadata) and produced a verified working payload. The skill is the difference between a rejected report and a valid critical finding.

xss-hunter

9/10 XSS found in 2 minutes vs 7/10 in 8 minutes.

Metric	Without Skill	With Skill	Improvement
Turns to complete	1	1	⚪ 0%
Total tokens	~4,378	~1,622	🟢 -63%
Time	~8 min	~2 min	🟢 -75%
XSS findings (out of 10)	7	9	🟢 +29%

Custom lab with 10 planted XSS vulnerabilities. Without the skill, the agent missed 3 including DOM-based and stored variants due to an incomplete coverage strategy and redundant recon steps. With xss-hunter, the pre-ordered test sequence (reflected → stored → DOM-based) eliminated redundancy and improved coverage.

jwt-cracker

3 turns → 1. -75% time.

Metric	Without Skill	With Skill	Improvement
Turns to complete	3	1	🟢 -67%
Total tokens	~14,458	~4,866	🟢 -66%
Time	355s	87s	🟢 -75%

Without the skill, the agent needed 2 correction prompts before producing a usable JWT test. With jwt-cracker, complete structured output on the first turn phases, expected outputs, and interpretation annotated.

control-lookup

3 turns → 1. High user effort → Low.

Metric	Without Skill	With Skill	Difference
Turns to complete	3	1	🟢 -2 turns
Response tokens	~3,297	~2,068	🟢 -37%
Total time	63s	51s	🟢 -19%
User effort	High	Low	🟢

Without the skill, the agent’s first response claimed it had already answered (it hadn’t) 2 more correction prompts needed. With control-lookup, it immediately produced the correct control card with cross-framework mappings to NIST CSF and PCI-DSS in a single turn.

cvss-scorer

-63% tokens, -68% time. Same score, no noise.

Metric	Without Skill	With Skill	Difference
Turns to complete	1	1	⚪ 0
Response tokens	~539	~201	🟢 -63%
Total time	18s	6s	🟢 -68%

Both produce the correct CVSS vector. The skill version is 3 lines vector, score, one contextual note. Without it, the agent writes 500 tokens of explanation around the same answer. Fast scoring for a busy pentest workflow.

scope-grill

-93% tokens. -90% time. Structured scope collection vs a wall of legal text.

Metric	Without Skill	With Skill	Difference
Turns to complete	2	1	🟢 -1 turn
Response tokens	~2,332	~165	🟢 -93%
Total time	43s	4s	🟢 -90%

Without the skill, the agent dumped a full legal disclaimer and engagement template overwhelming and not actionable. With scope-grill, it asked the first of 10 structured scoping questions and collected information one step at a time. Complete in 1 turn.

engagement-handoff

-58% tokens, -56% time. End-of-day status → structured handoff doc.

Metric	Without Skill	With Skill	Difference
Turns to complete	2	1	🟢 -1 turn
Response tokens	~2,499	~1,060	🟢 -58%
Total time	52s	23s	🟢 -56%

compliance-gap-analyzer

Partial → Complete in 1 turn. -38% tokens.

Metric	Without Skill	With Skill	Difference
Turns to complete	1	1	⚪ 0
Response tokens	~4,317	~2,661	🟢 -38%
Total time	88s	57s	🟢 -35%
Output quality	Partial	Complete	🟢 +2 levels

remediation-planner

2 turns → 1. -57% tokens.

Metric	Without Skill	With Skill	Difference
Turns to complete	2	1	🟢 -1 turn
Response tokens	~2,419	~1,052	🟢 -57%
Total time	44s	22s	🟢 -50%

risk-assessor

2 turns → 1 for CVE emergency patch decisions.

Metric	Without Skill	With Skill	Difference
Turns to complete	2	1	🟢 -1 turn
Response tokens	~4,629	~2,009	🟢 -57%
Total time	101s	68s	🟢 -32%

vuln-diagnose

-42% tokens, -53% time.

Metric	Without Skill	With Skill	Difference
Turns to complete	2	1	🟢 -1 turn
Response tokens	~2,255	~1,315	🟢 -42%
Total time	55s	26s	🟢 -53%

attack-surface

-37% tokens. Partial → Good attack surface map.

Metric	Without Skill	With Skill	Difference
Turns to complete	1	1	⚪ 0
Response tokens	~4,039	~2,562	🟢 -37%
Total time	88s	56s	🟢 -36%
Output quality	Partial	Good	🟢 +1 level

nuclei-template-writer

Good → Complete. Adds matcher strategy explanation the raw version skips.

Metric	Without Skill	With Skill	Difference
Turns to complete	1	1	⚪ 0
Response tokens	~1,632	~1,585	🟢 -3%
Output quality	Good	Complete	🟢 +1 level

ssti-hunter

Same result, faster. No wasted turns guessing the template engine.

Metric	Without Skill	With Skill	Improvement
Turns to complete	1	1	⚪ 0%
Total tokens	~3,475	~2,241	🟡 -36%
Time	105s	65s	🟡 -38%

Both runs found and exploited the SSTI. The difference is speed: without the skill, the agent spent extra turns guessing the template engine before picking payloads. With ssti-hunter, a deterministic polyglot detection sequence reached confirmed exploitation faster no engine guessing, no wasted turns.

pentest-report

Same quality, less overhead.

Metric	Without Skill	With Skill	Difference
Turns to complete	1	1	⚪ 0
Response tokens	~7,144	~5,444	🟢 -24%
Total time	128s	100s	🟢 -22%

Reading the color codes

Color	Meaning
🟢	Improvement
🟡	Moderate improvement
🔴	Trade-off more tokens/time for better quality
⚪	No change

Red is not always bad. Skills like bugbounty-reporter, js-analyzer, and check-exploit use more tokens because they produce more complete output. The benchmark shows the trade-off explicitly so you can decide whether it’s worth it for your workflow.

Overview

Attack Mindset

Web Application

AI / LLM Security

API Security

Infrastructure

Reconnaissance

Reporting

Compliance

Workflow

Integrations

How benchmarks work

Results at a glance

Detailed results

idor-hunter

find-skills

hexstrike-forge

ssrf-hunter

xss-hunter

jwt-cracker

control-lookup

cvss-scorer

scope-grill

engagement-handoff

compliance-gap-analyzer

remediation-planner

risk-assessor

vuln-diagnose

attack-surface

nuclei-template-writer

ssti-hunter

pentest-report

Reading the color codes

​How benchmarks work

​Results at a glance

​Detailed results

​idor-hunter

​find-skills

​hexstrike-forge

​ssrf-hunter

​xss-hunter

​jwt-cracker

​control-lookup

​cvss-scorer

​scope-grill

​engagement-handoff

​compliance-gap-analyzer

​remediation-planner

​risk-assessor

​vuln-diagnose

​attack-surface

​nuclei-template-writer

​ssti-hunter

​pentest-report

​Reading the color codes

How benchmarks work

Results at a glance

Detailed results

idor-hunter

find-skills

hexstrike-forge

ssrf-hunter

xss-hunter

jwt-cracker

control-lookup

cvss-scorer

scope-grill

engagement-handoff

compliance-gap-analyzer

remediation-planner

risk-assessor

vuln-diagnose

attack-surface

nuclei-template-writer

ssti-hunter

pentest-report

Reading the color codes