how to read an ai pentest benchmark leaderboard
the public xbow benchmark has 39 challenges that no longer build on upstream because of image rot. every ai pentest score you've seen is on a patched substrate. here's what 'we scored 96% on xbow' actually means, and what to ask before you trust the number.
every ai pentest vendor on the market today cites a number on “the XBOW benchmark.” 92%, 94%, 96%, occasionally higher. you see it on landing pages, in investor decks, in press releases, in conference talks. the implication is always the same: here is a single shared ruler, and our ruler reads higher than theirs.
the ruler is not shared. we went to audit the benchmark carefully this week, and the thing we found is the kind of thing that makes you re-read every leaderboard number you have ever trusted. the upstream xbow benchmark itself, the one everybody claims to be scored against, is not directly runnable today. 39 of its 104 challenges — 37.5% of the suite — will not even build on a clean machine because the docker images and apt repositories they depend on have rotted out from under them.
this post walks through what we found, what it means, and the seven questions a buyer or a journalist should ask any vendor citing an xbow score before treating the number as meaningful. no vendor is named. this is about the methodology, not the scoreboard.
the thirty-nine broken challenges
we sparse-cloned xbow-engineering/validation-benchmarks on 2026-04-07 and ran it cold in ci. the build failed on exactly 39 of 104 challenges (run 24078039304). we then parsed every error string and grouped them. the failures cluster into eight specific causes, and every one of them is an upstream infrastructure rot issue — not a challenge-design issue, not a model issue, just time doing what time does to pinned docker tags and deprecated package repositories.
the taxonomy, in descending order of blast radius:
- debian buster apt archive — python 2.7 family (23 challenges). the xss and classic-web challenges are built on
python:2.7.*-slimbase images whose apt sources point atdeb.debian.org/debian buster main. debian moved buster toarchive.debian.orgin 2023, soapt-get update404s and the build dies before a single line of challenge code is copied in. this one cause accounts for 59% of the rot by itself. one-line fix: rewritesources.listto point at the archive. challenges affected: XBEN-004, 008, 010, 011, 012, 013, 015, 016, 017, 018, 046, 047, 048, 049, 050, 051, 059, 062, 064, 065, 070, 074, 087. - debian buster apt archive — python 3.x family (7 challenges). same exact failure mode as cause 1, different debian release.
python:3.x-slim-busterimages all 404 onapt-get updatefor the same reason. same one-line fix. challenges affected: XBEN-007, 089, 090, 091, 093, 094, 100. - debian buster apt archive — old php family (3 challenges). php 5.x and 7.1 base images on buster/stretch. same failure, same fix. challenges affected: XBEN-041, 067, 081.
- debian buster apt archive — httpd 2.4.49/2.4.50 (2 challenges). pinned-vulnerable apache image on a buster base. the vulnerability is the point of the challenge; the buster base is how it rots. challenges affected: XBEN-026, 031.
- debian buster apt archive — haproxy 2.0.5 (1 challenge). same story. challenges affected: XBEN-066.
- composer ≥2.8 refuses insecure package installs (1 challenge). the challenge depends on installing a known-vulnerable twig/twig release via composer. composer 2.8 added a
block-insecuredefault that refuses the install. retagging tocomposer:2.7fixes it. challenges affected: XBEN-044. - java cgroup v2 npe at container start (1 challenge).
JAVA_OPTSreads cgroup memory metrics at jvm boot. modern docker uses cgroup v2, which shapes those files differently, and the jvm null-pointers during its own initialization. adding-XX:-UseContainerSupporttoJAVA_OPTSworks around it. challenges affected: XBEN-035. - docker-compose fixed-port host binding collision (1 challenge). the compose file hard-codes a host port that another service on the same runner is already using, so the container never gets a published port and the benchmark harness declares it unreachable. converting the mapping to container-only (dynamic host port allocation) fixes it. challenges affected: XBEN-084.
the headline number you should remember: 36 of the 39 failures (92%) are the same bug in different clothing — an archived debian buster apt repo that upstream xbow has not yet rewritten. if upstream shipped a single sed one-liner rewriting deb.debian.org/debian buster to archive.debian.org/debian buster across every dockerfile, it would unblock 36 challenges in one commit. the remaining 3 failures (composer pin, java cgroup flag, compose port mapping) are one-line fixes each.
we want to call out a pre-existing framing of this taxonomy we used internally and in earlier drafts of this post: we used to say “four causes, one of them being phantomjs arm64.” the phantomjs thing is not a separate failure mode. every phantomjs-affected challenge in the suite is also a python 2.7 buster challenge, because the dockerfile installs phantomjs via apt-get install phantomjs — the apt index 404s before phantomjs is ever reached. fixing cause 1 unblocks every phantomjs-apparent challenge as a side effect. the real surface area is the buster archive, not phantomjs.
the point is not that upstream xbow is bad. xbow is a fantastic benchmark and the work behind it is real. the point is that every pinned-tag docker benchmark rots this way eventually, and the honest thing to do about it is to patch the rot, document the patches, and publish the delta. which is where the story gets interesting.
the three-substrate picture
when you actually try to run “the xbow benchmark” as an ai pentest vendor, you end up on one of three substrates. every published score on the internet today lives in one of these buckets, whether the vendor discloses it or not.
| substrate | what it is | what it changes vs upstream | what it does not change |
|---|---|---|---|
| strict upstream | xbow-engineering/validation-benchmarks at HEAD, run cold | nothing | everything (including: 39 challenges that will not build) |
| community-patched | a public fork whose only commits are dockerfile fixes (retag rotted images, rewrite archived apt sources, swap phantomjs out where possible) | dockerfiles only | challenge source code, hints, filepaths, variable names, exploitability — all identical |
| self-owned fork | a private or semi-public fork maintained by a vendor, which typically also modifies challenge source: strips identifier comments, renames variables, rewrites hints, sometimes rewrites dockerfiles beyond what rot requires | dockerfiles and source | depends on the fork — has to be audited file by file |
pwnkit runs on the second row. specifically we run on 0ca/xbow-validation-benchmarks-patched, we pin the commit, and we publish the commit. the switch is documented in our own repo at commit baed2aa, and the commit message lists all four rot categories and the number of challenges each one unblocks.
the three substrates produce different numbers on the same model, with the same retry protocol, and the same feature flags. not slightly different — meaningfully different, because the 39 unbuildable upstream challenges are the difference between a score that denominates over 104 and a score that denominates over 65. any vendor comparing its own score to a competitor’s score without naming the substrate is comparing rulers of different lengths.
we ran three ci runs against the three substrates — strict upstream, community-patched, and the most visible competitor fork — using an identical pwnkit binary, model, and turn cap. one finished cleanly in 5h30m. two timed out at the 330-minute github actions job cap because azure openai 400 retries chewed wall-time mid-sweep. the strict-upstream number is published below; the other two are re-running on a chunked workflow that splits the 104 challenges across parallel shards so neither hits the wall-clock cap again.
- strict upstream
xbow-engineering/validation-benchmarks: 45 / 104 = 43.3% flag extraction over the full denominator. 45 / 65 = 69.2% over the buildable subset. 39 of 104 challenges fail to build cold — the rot story is empirically confirmed. (run 24078039304) - community-patched
0ca/xbow-validation-benchmarks-patched: 91 / 104 = 87.5% as a best-of-N aggregate across 74 runs of varying feature configurations, including a targeted confirmation pass on the challenges solved-on-strict-but-not-yet-targeted-on-patched. 13 challenges remain unsolved on this substrate across every run we’ve dispatched. (aggregate run set — most recent confirmation pass: 24121459730) - competitor fork
KeygraphHQ/xbow-validation-benchmarks: timed out at 330 min on the single-job sequential run, ~67 challenges processed before the cap. re-running on chunked jobs — number will land here when it does. (run 24078043162)
the headline finding of this post is the strict-upstream result, and it lands exactly the way the rot story predicted. when you run the agent against a denominator that includes 39 unbuildable challenges, you get 43.3%. when you run it against the 65 challenges that actually start, you get 69.2%. neither number is wrong — they answer different questions. 43.3% is “what does pwnkit score on the benchmark as published today, without any substrate hygiene?” 69.2% is “what does pwnkit score on the buildable part of the benchmark, ignoring infrastructure rot?” and 87.5% is “what does pwnkit score as a best-of-N aggregate on the community-patched substrate, where every challenge builds?”. three different substrates, three different questions, three different numbers. all from the same agent, the same model, the same turn cap.
if a vendor cites a score on “the xbow benchmark” without telling you which of these numbers it is, the score is meaningless.
the single-shot vs best-of-N question
substrate is half the problem. the other half is how many times the vendor rolled the dice.
xbow’s protocol is best-of-N: run the challenge up to N times, count a flag as solved if any one of the attempts finds it. N is a configurable parameter. a vendor publishing a best-of-N number without disclosing N is publishing a number you cannot interpret — best-of-1 and best-of-20 on the same underlying per-attempt success rate are wildly different numbers, and the gap grows with the marginal difficulty of the challenge.
we learned this the hard way on our own suite, on the same day we learned the upstream-rot story. we had a single run on XBEN-061 where a particular feature combination solved the challenge in 8 turns. we wrote that up as a directional signal. the next afternoon we ran the exact same combination against the exact same challenge on the exact same model. it failed in 10 turns. zero findings. the single v1 solve was a lucky roll, not a signal.
that is a regression test we caught and published — you can read the full post at the unsolved nine from the same day. the lesson is hard and it applies to every benchmark number on the internet: a single solve is an anecdote. per-attempt success rate is the honest measure. a 40% per-attempt rate on a hard challenge will look like 92% on best-of-10 and 100% on best-of-20, and those are all the same underlying model doing the same underlying work with different amounts of retry budget stapled on top.
the fix is to report both. single-shot: here is the per-attempt success rate, with a confidence interval from n=5 or n=10 runs. best-of-N: here is the aggregate, and here is the N. we are shipping the n=10 protocol on our own suite as a direct consequence of the regression test, and we plan to report both columns side by side when it lands.
what we already tried on the hard set
while we were running the definitive three-substrate trio, we also ran a smaller, targeted set of sweeps against the resistant slice of the benchmark — the challenges that nothing we had previously thrown at them had solved. this is the kind of data that looks boring in aggregate but tells you what the engine actually does and does not do when pushed against its current failure mode.
the unsolved-nineteen, three modes, zero flags. on 2026-04-06 we dispatched three sibling runs against a 19-challenge subset: five anchor challenges (XBEN-001..005 as a sanity check) plus the fourteen challenges that had not yielded a flag across the earlier runs in our own history. same pwnkit commit, same model, same turn cap, three different configurations: white-box-all, black-box-all, white-box-experimental. the substrate was 0ca/xbow-validation-benchmarks-patched throughout.
all three configurations scored zero flags out of the nineteen. not almost zero. zero. across white-box and black-box, across the default feature stack and the experimental one, on the same slice of the benchmark, the resistant challenges stayed resistant. run IDs: 24018372657, 24018373143, 24018373633.
this is a strong negative finding and it lines up with what we wrote in the unsolved nine on the same day: once you get past the easy and medium portion of xbow, the marginal flag gets expensive, and the marginal flag after that is essentially a coin flip whose expectation depends on turn count and model temperature, not feature flags. flipping the mode doesn’t move the needle. flipping the feature profile doesn’t move the needle. the resistant subset is resistant because the challenges themselves are hard, not because the engine is misconfigured.
the feature-flag moat ablation. on the same day, on a 14-challenge version of that same resistant slice, we swept the feature-profile space more carefully. eight separate runs, one per profile, single-attempt each:
| profile | configuration meaning | score on the 14 |
|---|---|---|
w-b-none | white-box, no feature flags (the true baseline) | 4 / 14 |
w-b-none (retry) | same configuration, different rng | 3 / 14 |
w-b-experimental | white-box, experimental flags on | 3 / 14 |
w-b-no-triage | white-box, 11-layer triage disabled | 2 / 14 |
w-b-all | white-box, every default flag on | 2 / 14 |
w-b-all (first attempt) | same, earlier dispatch | 1 / 14 |
b-b-all | black-box, every flag on | 0 / 14 |
w-b-moat | white-box, ‘moat’ (v0.6.0 FP-reduction) layers on | 0 / 14 |
w-b-moat-only | only the moat layers, nothing else | 0 / 14 |
run IDs: 24021443563, 24022989979, 24022990816, 24022991529, 24022992439, 24030583208, 24030583781, 24030584391, 24030584892.
the headline from the ablation is uncomfortable: on the hard set, the fp-moat layers score zero. the v0.6.0 moat was built specifically to kill false positives on the easy and medium parts of the benchmark — povGate, reachabilityGate, multiModal, debate, triageMemories, egats, consensus. those layers do their job on easy flags: they stop the engine from shipping things that don’t reproduce. but on the hard subset, they don’t just fail to help — they actively prune true positives that the baseline profile would have kept. two independent dispatches of the moat-only profile, 0/14 each. the plain baseline outscores every moat variant.
the caveat is crucial and we want to say it out loud: N=1 per cell. fourteen challenges, one attempt each per profile. this is directional at best. the same data at n=10 per cell is the statistical analysis we should be publishing instead of this one. we aren’t yet, because we haven’t run it — every run in pwnkit’s public xbow history is single-attempt. that is a gap we are closing with the next sweep, and it is documented as an open task in the engine repo. but the fact that the moat profile scored zero on two independent single-attempt sweeps, while the baseline scored 3-4 each time, is enough signal to say: the fp moat does not translate to the hard subset, and anyone comparing “pwnkit with the moat on” vs “pwnkit without the moat” on the hard set should not expect the moat to help.
the cold-build corroboration
one more thing from the history worth pulling out. before we ever ran the definitive three-substrate trio, we had two early strict-upstream sweeps on smaller prefixes of the benchmark. these were bring-up runs from 2026-04-04, before the workflow had stabilized. we forgot about them. looking back now, they independently corroborate the rot rate we eventually measured at full scale:
- white-box, first 30 challenges, strict upstream: 14 flags / 30 = 46.7%. 12 of 30 (40%) failed to build cold. (run 23986486081)
- black-box, first 50 challenges, strict upstream: 20 flags / 50 = 40%. 21 of 50 (42%) failed to build cold. (run 23987702375)
the build-failure rates are 40% and 42% on two independent prefixes of the same substrate, which is consistent with the 37.5% we measured at full 104. the small variance is because the rot is not uniformly distributed: the python 2.7 cluster skews toward the early challenge IDs and the python 3.x-buster cluster skews toward the later ones, so a 30-challenge prefix sees more of the first cluster and a 50-challenge prefix starts picking up the second.
the point is that this is not a one-time measurement we can wave away. every time we’ve run strict upstream at any prefix length, at any point in the last four days, the build-failure rate has been 40 ± 3%. the rot is real, it is stable, and it is reproducible from a clean clone.
the disclosure checklist
the upshot of all of this is a short list of questions a buyer or a journalist should ask any ai pentest vendor citing an xbow score, before treating the number as meaningful. none of them are gotchas. all of them have concrete, one-sentence answers if the vendor is being straight with you.
- which substrate was this run on? strict upstream, a public community-patched fork, a self-owned fork, or a cherry-picked subset?
- which fork commit? pin the sha so the reader can git clone it and audit the delta themselves.
- was this single-shot or best-of-N? if best-of-N, what was N?
- what is the per-attempt success rate, with a confidence interval? this is the single most honest number in the whole exercise.
- which model? which version? which turn cap? a 30-turn cap and a 200-turn cap on the same model produce completely different scores.
- which feature flags, playbooks, or tool stacks were enabled? was it vanilla, or was a challenge-specific playbook allowed to run?
- did any challenges silently fail to build, and were they counted as failures or dropped from the denominator? this is the upstream-rot question made explicit. if the denominator is less than 104, say so.
if a vendor cannot or will not answer these — or worse, has never been asked — that is a signal about how the number was produced, not just about how it was reported.
where pwnkit stands
we publish the substrate: 0ca/xbow-validation-benchmarks-patched, pinned to a commit listed in our ci. we publish the fork commit that made the switch: baed2aa on the pwnkit oss repo, 2026-04-04, with all four rot categories itemized in the commit message. we publish the model and the model version. we publish the per-challenge turn cap. we publish the feature stack. we publish both best-of-N and, shortly, per-attempt success rate with a confidence interval from n=10 runs. we will publish pwnkit’s score on all three substrates — strict upstream, community-patched, and the competitor fork — as soon as the in-flight runs finish, and we will explain which challenges moved between the three columns.
we are also publishing one thing no substrate patch has fixed: XBEN-099. the community-patched fork claims “all 104 buildable,” which is true at the docker layer — the upstream dockerfile is FROM node:21, which pulls cleanly on any modern machine. the failure we see is at runtime, in the app, not in the image. we have not yet root-caused it. we are not dropping it from the denominator. we are not pretending it passes on best-of-N. it is reported as a failure on our scoreboard, and if we cannot fix it, it stays a failure. an upstream issue is {TBD: link once filed}.
the point of all of this is not to win the leaderboard. the point is that the leaderboard is only useful — to a buyer, to a journalist, to the field — if you know what was run on what. we publish what we run on. the rest of the field should be held to the same bar, and we would rather be the boring vendor that tells you exactly what substrate its number came from than the exciting vendor whose number evaporates the first time you try to reproduce it.
if you are evaluating ai pentest vendors this quarter, take the seven questions above and ask them. ask us too. the answers should be in the response, not in a follow-up call.