Build Recap ⏱️ 8 min read

We Threw Out 132,000 AI Skills

TL;DR

💥 What we did: We generated 132,328 AI skills with a deterministic factory, then declared the whole library unusable and kept only its taxonomy.

🔎 Why: A skill's description is what makes Claude fire it. Quantity was the wrong axis — triggering, not body prose, is the real quality ceiling.

🛡️ The new bar: A skill is admitted only if it passes a lint, a triggering eval, a fact-check, AND a 4-judge panel scores it a mean of 9 out of 10 or higher.

🎯 The punchline: The gate rejected its own first exemplar for citing a made-up number. The corrected version passed at a mean of 9.0.

The result: A new skills-vault, filled one perfect skill at a time. We'd rather build 5 perfect skills than 500 good ones.

📊 Build Stats

This is the receipt for the hardened run that admitted the first vault skill. Slow and expensive on purpose — the numbers are the point, not a brag.

MetricValue
Skill builtemail-authentication (SPF, DKIM, DMARC)
Agents spawned7
Subagent tokens582,887
Tool calls153
Wall-clock time~17 minutes
Attempts to admit1 (first try)
Adversarial panel mean9.0 / 10
VerdictAdmitted to the vault

The headline run isn't the first run. The original verify pass took 14 agents, 1,161,109 subagent tokens, 326 tool calls and ~40 minutes across two attempts — because attempt one got rejected. The hardened re-run above passed clean. We'll get to why.

🗑️ Why 132,328 skills were the wrong answer

We built a deterministic factory and it generated 132,328 skills: 131,827 machine-generated plus 501 hand-curated. Then we declared the whole library unusable.

Not corrupt. Not broken. Just the wrong thing to optimize. We had been racing on quantity, and quantity is an axis that doesn't matter.

The one piece worth keeping was the map. The factory had ranked the whole space into a taxonomy, and that survived the cull.

GENERATED 132,328 skills · unusable discard 🗑 KEPT the taxonomy only 33 disciplines 6,656 subdomains → ranked into a vault queue
132,328 skills in, one map out. The taxonomy ranked 6,656 high-value subdomains into the queue the vault now pulls from.

Core insight: A library of 132,000 skills is worth less than a library of one, if the 132,000 never fire correctly. The count was a vanity metric.

🧠 The finding that reframed everything

Here is the research finding that flipped the whole project. A skill's description is its activation mechanism, not its body.

At startup, Claude only loads each skill's name and description — roughly 100 tokens each. It never reads the bodies until it has already decided which skill to fire.

So the choice of which skill activates is pure inference over those short descriptions. The body prose you slaved over is invisible at the moment that matters.

AT STARTUP — LOADED FOR EVERY SKILL name + description ~100 tokens · this picks the winner NOT LOADED UNTIL CHOSEN the skill body 500+ lines · invisible at choice-time inference over descriptions only ⇒ triggering is the quality ceiling. The old factory polished the wrong half.
Vercel measured skills never getting invoked in 56% of cases. Directive, specific descriptions activate far more reliably than vague ones.

Vercel measured this in the wild: skills sat unused in 56% of cases where they should have fired. The descriptions weren't directive enough to win the inference.

Our old factory had been polishing the body — the exact half of the skill that doesn't decide whether it ever runs.

Think of a skill's description like the label on a fire extinguisher. In an emergency nobody reads the instruction booklet inside — they read the one-line label and grab it or skip it. If the label is vague, the best extinguisher in the building never gets used.

📐 The gold standard and the gate

So we rewrote what "good" means. A skill now has to be a lean router, not an essay.

Lean router body

Under 500 lines. The body points; it doesn't lecture. References go one level deep, no deeper.

Bundled script

If the work is deterministic, ship code that does it. Don't make the model re-derive what a script can compute.

Worked example

One example, proven on real verified numbers — not a plausible-looking reconstruction.

Then we put a hard verification gate in front of every candidate. Four checks, and all of them have to pass.

[1] Lint — structure, length, reference depth

[2] Triggering eval — does it fire when it should, and stay quiet when it shouldn't

[3] Fact-check — every claim verified; one fatal error is a reject

[4] Adversarial panel — 4 judges, each 1 to 10, mean must be ≥ 9

Core insight: A skill that scores 8 is a reject, not a near-miss. The gate has no "close enough" band — triggering must pass, the fact-check must find no fatal error, and the panel mean must clear 9.

🎭 The gate rejected its own first exemplar

Here is the part that proves the gate is real. The very first skill we ran through it — email authentication, the SPF/DKIM/DMARC checks that decide whether your email lands or bounces — got rejected.

It had a perfect trigger and a real bundled DNS validator. The triggering eval scored a perfect recall of 1.0 and a perfect near-miss rejection of 1.0.

Then one judge did something a rubber-stamp inspector never would: it ran the bundled validator against live DNS. And it caught a lie.

The worked example claimed github.com's SPF record sat at "8 of 10 DNS lookups with headroom." The judge checked. The real recursive count is already 10 — right at the failure ceiling, no headroom at all. A confident, plausible, completely wrong number. Click through the four stages and watch it happen.

Verification gate · email-authentication
running: original draft
Stage 1 · Lint

Stage four is where it died. Mean 8.75, below the bar, rejected — even though three of the four checks were clean and the author was the team itself.

It's like a quality inspector who fails the prototype the boss personally hand-built. That's the moment you know the inspection is real and not theater. A gate that would wave through your own work isn't a gate — it's a rubber stamp. The bouncer that would turn away the owner is the only bouncer worth having.

🔧 The fix, and the law it forced

The revision loop fed the exact error back to the author: this number is wrong, here's the live count. The author fixed it, and only the corrected version — mean 9.0 — was admitted.

Then we traced the root cause. The worked example hadn't been pasted from a real run; it had been hand-reconstructed from memory. That's how the made-up number got in.

So we added one law to the gold standard.

Do

If a skill ships a script, its worked example must be that script's exact verbatim output from an actual run. Paste it, don't retype it.

Don't

Don't reconstruct an example by hand. A plausible number written from memory is exactly how a confident, wrong figure slips past every reviewer who isn't running the code.

We re-ran with that law in place. The next version passed on the first attempt — the hardened run from the stats box at the top.

Better still, it turned the old mistake into the skill's best teaching moment. The corrected example now explains why 8 visible include mechanisms recursively expand to 10 lookups, with a second live example to prove it.

📈 What max effort cost — the honest trade-off

This is the opposite of the 132k sweep, and the bill says so. Both runs admitted exactly one skill.

RunAgentsSubagent tokensTool callsTimeAttemptsOutcome
Original verify141,161,109326~40 min28.75 reject → 9.0 admit
Hardened re-run7582,887153~17 min19.0, first try

So the real cost of one vault skill is roughly 17 to 40 minutes and 0.6 to 1.2 million subagent tokens. The vault grows only a handful of skills a day.

That's slow and expensive by design. We don't have a verified dollar figure to quote, so we won't invent one — that would be the exact sin this whole post is about.

Core insight: The expense is the feature. A gate cheap enough to run on everything is a gate weak enough to pass everything. The integrity tax is what keeps the bar at 9.

🔐 The skills-vault: 9s and 10s only

That one admitted skill is now the first member of a new library: the skills-vault. It fills autonomously and continuously from the ROI-ranked queue the old taxonomy left behind.

One skill at a time. Each one passes triggering, survives the fact-check, and clears a panel mean of 9 — or it doesn't go in.

🗂️ ROI-ranked queue (6,656 subdomains)

✍️ Author one skill — lean router + bundled script + verbatim example

🛡️ The 4-stage gate — lint, triggering, fact-check, panel ≥ 9

✅ Admitted to the vault — or rejected and sent back

We would rather build 5 perfect skills than 500 good ones. The 132,000 taught us the ceiling on quantity. The vault is the bet on quality, and the gate is what makes the bet honest.

Run your work through a gate this strict

Godmode is the execution layer that verifies before it ships — the same discipline that rejected our own first skill. Start free with Godmode Lite, or see the full protocol.

Download Godmode Lite See pricing

More on how the gate got this strict: we shipped five skills nobody could install, the audit that found the bugs in our own auditor, the blind experiment that proved scoring works, and the time we caught Claude grading its own open-book exam.