We Threw Out 132,000 AI Skills
💥 What we did: We generated 132,328 AI skills with a deterministic factory, then declared the whole library unusable and kept only its taxonomy.
🔎 Why: A skill's description is what makes Claude fire it. Quantity was the wrong axis — triggering, not body prose, is the real quality ceiling.
🛡️ The new bar: A skill is admitted only if it passes a lint, a triggering eval, a fact-check, AND a 4-judge panel scores it a mean of 9 out of 10 or higher.
🎯 The punchline: The gate rejected its own first exemplar for citing a made-up number. The corrected version passed at a mean of 9.0.
✅ The result: A new skills-vault, filled one perfect skill at a time. We'd rather build 5 perfect skills than 500 good ones.
Build Stats
This is the receipt for the hardened run that admitted the first vault skill. Slow and expensive on purpose — the numbers are the point, not a brag.
| Metric | Value |
|---|---|
| Skill built | email-authentication (SPF, DKIM, DMARC) |
| Agents spawned | 7 |
| Subagent tokens | 582,887 |
| Tool calls | 153 |
| Wall-clock time | ~17 minutes |
| Attempts to admit | 1 (first try) |
| Adversarial panel mean | 9.0 / 10 |
| Verdict | Admitted to the vault |
The headline run isn't the first run. The original verify pass took 14 agents, 1,161,109 subagent tokens, 326 tool calls and ~40 minutes across two attempts — because attempt one got rejected. The hardened re-run above passed clean. We'll get to why.
Why 132,328 skills were the wrong answer
We built a deterministic factory and it generated 132,328 skills: 131,827 machine-generated plus 501 hand-curated. Then we declared the whole library unusable.
Not corrupt. Not broken. Just the wrong thing to optimize. We had been racing on quantity, and quantity is an axis that doesn't matter.
The one piece worth keeping was the map. The factory had ranked the whole space into a taxonomy, and that survived the cull.
Core insight: A library of 132,000 skills is worth less than a library of one, if the 132,000 never fire correctly. The count was a vanity metric.
The finding that reframed everything
Here is the research finding that flipped the whole project. A skill's description is its activation mechanism, not its body.
At startup, Claude only loads each skill's name and description — roughly 100 tokens each. It never reads the bodies until it has already decided which skill to fire.
So the choice of which skill activates is pure inference over those short descriptions. The body prose you slaved over is invisible at the moment that matters.
Vercel measured this in the wild: skills sat unused in 56% of cases where they should have fired. The descriptions weren't directive enough to win the inference.
Our old factory had been polishing the body — the exact half of the skill that doesn't decide whether it ever runs.
Think of a skill's description like the label on a fire extinguisher. In an emergency nobody reads the instruction booklet inside — they read the one-line label and grab it or skip it. If the label is vague, the best extinguisher in the building never gets used.
The gold standard and the gate
So we rewrote what "good" means. A skill now has to be a lean router, not an essay.
Lean router body
Under 500 lines. The body points; it doesn't lecture. References go one level deep, no deeper.
Bundled script
If the work is deterministic, ship code that does it. Don't make the model re-derive what a script can compute.
Worked example
One example, proven on real verified numbers — not a plausible-looking reconstruction.
Then we put a hard verification gate in front of every candidate. Four checks, and all of them have to pass.
[1] Lint — structure, length, reference depth
↓
[2] Triggering eval — does it fire when it should, and stay quiet when it shouldn't
↓
[3] Fact-check — every claim verified; one fatal error is a reject
↓
[4] Adversarial panel — 4 judges, each 1 to 10, mean must be ≥ 9
Core insight: A skill that scores 8 is a reject, not a near-miss. The gate has no "close enough" band — triggering must pass, the fact-check must find no fatal error, and the panel mean must clear 9.
The gate rejected its own first exemplar
Here is the part that proves the gate is real. The very first skill we ran through it — email authentication, the SPF/DKIM/DMARC checks that decide whether your email lands or bounces — got rejected.
It had a perfect trigger and a real bundled DNS validator. The triggering eval scored a perfect recall of 1.0 and a perfect near-miss rejection of 1.0.
Then one judge did something a rubber-stamp inspector never would: it ran the bundled validator against live DNS. And it caught a lie.
The worked example claimed github.com's SPF record sat at "8 of 10 DNS lookups with headroom." The judge checked. The real recursive count is already 10 — right at the failure ceiling, no headroom at all. A confident, plausible, completely wrong number. Click through the four stages and watch it happen.
Stage four is where it died. Mean 8.75, below the bar, rejected — even though three of the four checks were clean and the author was the team itself.
It's like a quality inspector who fails the prototype the boss personally hand-built. That's the moment you know the inspection is real and not theater. A gate that would wave through your own work isn't a gate — it's a rubber stamp. The bouncer that would turn away the owner is the only bouncer worth having.
The fix, and the law it forced
The revision loop fed the exact error back to the author: this number is wrong, here's the live count. The author fixed it, and only the corrected version — mean 9.0 — was admitted.
Then we traced the root cause. The worked example hadn't been pasted from a real run; it had been hand-reconstructed from memory. That's how the made-up number got in.
So we added one law to the gold standard.
Do
If a skill ships a script, its worked example must be that script's exact verbatim output from an actual run. Paste it, don't retype it.
Don't
Don't reconstruct an example by hand. A plausible number written from memory is exactly how a confident, wrong figure slips past every reviewer who isn't running the code.
We re-ran with that law in place. The next version passed on the first attempt — the hardened run from the stats box at the top.
Better still, it turned the old mistake into the skill's best teaching moment. The corrected example now explains why 8 visible include mechanisms recursively expand to 10 lookups, with a second live example to prove it.
What max effort cost — the honest trade-off
This is the opposite of the 132k sweep, and the bill says so. Both runs admitted exactly one skill.
| Run | Agents | Subagent tokens | Tool calls | Time | Attempts | Outcome |
|---|---|---|---|---|---|---|
| Original verify | 14 | 1,161,109 | 326 | ~40 min | 2 | 8.75 reject → 9.0 admit |
| Hardened re-run | 7 | 582,887 | 153 | ~17 min | 1 | 9.0, first try |
So the real cost of one vault skill is roughly 17 to 40 minutes and 0.6 to 1.2 million subagent tokens. The vault grows only a handful of skills a day.
That's slow and expensive by design. We don't have a verified dollar figure to quote, so we won't invent one — that would be the exact sin this whole post is about.
Core insight: The expense is the feature. A gate cheap enough to run on everything is a gate weak enough to pass everything. The integrity tax is what keeps the bar at 9.
The skills-vault: 9s and 10s only
That one admitted skill is now the first member of a new library: the skills-vault. It fills autonomously and continuously from the ROI-ranked queue the old taxonomy left behind.
One skill at a time. Each one passes triggering, survives the fact-check, and clears a panel mean of 9 — or it doesn't go in.
↓
✍️ Author one skill — lean router + bundled script + verbatim example
↓
🛡️ The 4-stage gate — lint, triggering, fact-check, panel ≥ 9
↓
✅ Admitted to the vault — or rejected and sent back
We would rather build 5 perfect skills than 500 good ones. The 132,000 taught us the ceiling on quantity. The vault is the bet on quality, and the gate is what makes the bet honest.
Run your work through a gate this strict
Godmode is the execution layer that verifies before it ships — the same discipline that rejected our own first skill. Start free with Godmode Lite, or see the full protocol.
Download Godmode Lite See pricingMore on how the gate got this strict: we shipped five skills nobody could install, the audit that found the bugs in our own auditor, the blind experiment that proved scoring works, and the time we caught Claude grading its own open-book exam.