Field notes

The pwnkit blog

Name: pwnkit
Author: Doruk Tan Ozturk

Field notes on AI pentesting, agentic security, the XBOW benchmark, and the vulnerabilities autonomous AI agents find when you point them at real code. Built and written by the team behind pwnkit, the leading open-source AI pentest agent.

April 7, 2026
deleting better-sqlite3 from pwnkit, and what it cost us

pwnkit 0.7.1 ships with zero native modules. the persistence layer was migrated from better-sqlite3 to a pure-wasm sqlite implementation. here's what broke, what we kept, and why every npx install on every node version now just works.
April 7, 2026
how to read an ai pentest benchmark leaderboard

the public xbow benchmark has 39 challenges that no longer build on upstream because of image rot. every ai pentest score you've seen is on a patched substrate. here's what 'we scored 96% on xbow' actually means, and what to ask before you trust the number.
April 7, 2026
the unsolved nine: one win, one honeypot, and a regression test that killed our hypothesis

an A/B sweep over the 9 challenges keeping pwnkit off 100% on XBOW. one new flag, one honeypot, and a same-day regression test that killed our 'lean scaffolding wins' hypothesis. why a single XBOW solve is an anecdote, not a benchmark.
April 6, 2026
introducing pwnkit cloud

an autonomous ai attacker on contract, pointed at your product. closed beta. by application only. founder-led from zürich.
April 5, 2026
pwnkit oss now outperforms commercial pentest teams

the open-source agent just crossed the line where it finds more real bugs on the public xbow benchmark than the pentest engagement you were about to pay for.
April 4, 2026
the attack surface XBOW and KinoSec don't test

traditional web vuln benchmarks miss the entire AI/LLM security attack surface. prompt injection, jailbreaks, MCP tool abuse -- none of it shows up in XBOW's 104 challenges.
April 4, 2026
we built benchmarks for everything pwnkit does

five benchmark suites across web pentesting, LLM security, LLM safety, npm auditing, and network pentesting. here's what we learned.
April 4, 2026
pwnkit v0.4: shell-first pentesting, 27 XBOW flags, and the bug that broke everything

rebuilding pwnkit's agent architecture from structured tools to shell-first, cracking 23 XBOW benchmark challenges, and the serialization bug that was crashing the agent after 3 turns.
April 4, 2026
why we gave our agent a terminal instead of tools

we built 10 structured tools for web pentesting. then we gave the agent just curl and it outperformed everything.
April 4, 2026
what we learned running pwnkit against 104 CTF challenges

29 flags, a serialization bug, a 770-line prompt that didn't help, and why the model matters more than the framework.
April 4, 2026
running pwnkit against the XBOW benchmark

XBOW has 104 Docker CTF challenges covering traditional web vulns. here's how pwnkit performs against it.
April 3, 2026
100% on our AI security benchmark

10 challenges. 10 flags extracted. zero false positives. how pwnkit's agentic pipeline handles prompt injection, jailbreaks, SSRF, and multi-turn escalation.
March 29, 2026
why i built blind verification

every security scanner drowns you in false positives. it took three approaches before one of them actually worked.
March 27, 2026
why i built pwnkit

from 7 CVEs and manual pentesting to autonomous AI agents that re-exploit every finding to kill false positives.
March 26, 2026
how ai agents found 7 CVEs in popular npm packages

three weeks of pointing Claude Opus at npm packages produced 73 findings, 7 published CVEs, and 40M+ weekly downloads affected. here's how the workflow actually works.
March 26, 2026
the age of agentic security

if AI agents can write 1,000 pull requests a week, AI agents should be testing 1,000 pull requests a week. the asymmetry is about to collapse.

deleting better-sqlite3 from pwnkit, and what it cost us

how to read an ai pentest benchmark leaderboard

the unsolved nine: one win, one honeypot, and a regression test that killed our hypothesis

introducing pwnkit cloud

pwnkit oss now outperforms commercial pentest teams

the attack surface XBOW and KinoSec don't test

we built benchmarks for everything pwnkit does

pwnkit v0.4: shell-first pentesting, 27 XBOW flags, and the bug that broke everything

why we gave our agent a terminal instead of tools

what we learned running pwnkit against 104 CTF challenges

running pwnkit against the XBOW benchmark

100% on our AI security benchmark

why i built blind verification

why i built pwnkit

how ai agents found 7 CVEs in popular npm packages

the age of agentic security