The pwnkit blog
Field notes on AI pentesting, agentic security, the XBOW benchmark, and the vulnerabilities autonomous AI agents find when you point them at real code. Built and written by the team behind pwnkit, the leading open-source AI pentest agent.
-
deleting better-sqlite3 from pwnkit, and what it cost us
pwnkit 0.7.1 ships with zero native modules. the persistence layer was migrated from better-sqlite3 to a pure-wasm sqlite implementation. here's what broke, what we kept, and why every npx install on every node version now just works.
-
how to read an ai pentest benchmark leaderboard
the public xbow benchmark has 39 challenges that no longer build on upstream because of image rot. every ai pentest score you've seen is on a patched substrate. here's what 'we scored 96% on xbow' actually means, and what to ask before you trust the number.
-
the unsolved nine: one win, one honeypot, and a regression test that killed our hypothesis
an A/B sweep over the 9 challenges keeping pwnkit off 100% on XBOW. one new flag, one honeypot, and a same-day regression test that killed our 'lean scaffolding wins' hypothesis. why a single XBOW solve is an anecdote, not a benchmark.
-
introducing pwnkit cloud
an autonomous ai attacker on contract, pointed at your product. closed beta. by application only. founder-led from zürich.
-
pwnkit oss now outperforms commercial pentest teams
the open-source agent just crossed the line where it finds more real bugs on the public xbow benchmark than the pentest engagement you were about to pay for.
-
the attack surface XBOW and KinoSec don't test
traditional web vuln benchmarks miss the entire AI/LLM security attack surface. prompt injection, jailbreaks, MCP tool abuse -- none of it shows up in XBOW's 104 challenges.
-
we built benchmarks for everything pwnkit does
five benchmark suites across web pentesting, LLM security, LLM safety, npm auditing, and network pentesting. here's what we learned.
-
pwnkit v0.4: shell-first pentesting, 27 XBOW flags, and the bug that broke everything
rebuilding pwnkit's agent architecture from structured tools to shell-first, cracking 23 XBOW benchmark challenges, and the serialization bug that was crashing the agent after 3 turns.
-
why we gave our agent a terminal instead of tools
we built 10 structured tools for web pentesting. then we gave the agent just curl and it outperformed everything.
-
what we learned running pwnkit against 104 CTF challenges
29 flags, a serialization bug, a 770-line prompt that didn't help, and why the model matters more than the framework.
-
running pwnkit against the XBOW benchmark
XBOW has 104 Docker CTF challenges covering traditional web vulns. here's how pwnkit performs against it.
-
100% on our AI security benchmark
10 challenges. 10 flags extracted. zero false positives. how pwnkit's agentic pipeline handles prompt injection, jailbreaks, SSRF, and multi-turn escalation.
-
why i built blind verification
every security scanner drowns you in false positives. it took three approaches before one of them actually worked.
-
why i built pwnkit
from 7 CVEs and manual pentesting to autonomous AI agents that re-exploit every finding to kill false positives.
-
how ai agents found 7 CVEs in popular npm packages
three weeks of pointing Claude Opus at npm packages produced 73 findings, 7 published CVEs, and 40M+ weekly downloads affected. here's how the workflow actually works.
-
the age of agentic security
if AI agents can write 1,000 pull requests a week, AI agents should be testing 1,000 pull requests a week. the asymmetry is about to collapse.