Built by /blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.
Get free skill (account)
Experiment ⏱️ 7 min read

We Split One Skill Into 14 Files. Then We Raced Them.

TL;DR

🧪 The experiment: One-Shot Beta as one big file vs decomposed into 14 modular scripts
🏁 The result: Modular won 3-0 across small, medium, and large tasks
💥 The surprise: The monolith never wrote unit tests. The modular version always did.

🤔 The Question Nobody Could Answer

One-Shot Beta is a 400-line execution protocol spread across 4 files. It works. It ships code that passes its own rubric 92% of the time.

But what if each phase lived in its own file? Would the AI follow instructions better when they arrive one script at a time — or worse?

Think of it like this: One approach is a textbook — everything in chapters, read front to back. The other is a binder of laminated checklists — pull the one you need, follow it, put it back. Same information, different delivery. Does the format change how well the student performs?

We had strong opinions on both sides. So we stopped arguing and ran the experiment.

⚙️ The Setup

Contestant A: The Monolith

/one-shot-beta — the production skill. Four files: SKILL.md (the orchestrator), execution-protocol.md, scoring-and-assessment.md, and delivery-format.md. All phases live together in one file.

Contestant B: The Modular Version

/one-shot-scripts — same protocol, decomposed into 14 standalone scripts. One file per phase, loaded on demand. The orchestrator is a slim routing document: “Read scripts/phase-2-build.md and execute.”

AspectMonolith (A)Modular (B)
Total files415
Phase loadingAll in one fileOne file per phase
File reads per run~4~10–14
Content identical?Yes — word for word

🔬 The Protocol (and the Round We Threw Out)

We actually ran 4 rounds. The first one got scrapped.

During Round 1’s second session, one of the skills found the safe-parse directory already written by the first session’s run — and used it. Instead of building from scratch, it read the existing code, “improved” it, and claimed credit. A contaminated test.

Lesson learned: A/B testing AI skills requires true isolation. If both sessions write to the same directory, the second one isn’t starting from zero — it’s getting a head start. We scrapped the round, cleaned the directory, and re-ran with separate output paths.

Rounds 1–3 below used isolated directories: safe-parse-a-monolith/ and safe-parse-b-modular/. No cross-contamination.

Each valid round followed the same process:

📋 Same prompt for both skills

🆕 Fresh session → run /one-shot-beta (A)

🆕 Fresh session → run /one-shot-scripts (B)

🆕 Fresh session → run /ab-grader on both outputs

Control variables: same prompt (word for word), fresh sessions (no prior context), same model (Opus 4.6), same rubric (8 dimensions, 0.03 significance threshold).

3-ROUND REPLAY · PICK A SIZE
Round 1 · Small — Build a Node.js safe-parse module (3 functions, full test suite, TS defs).
Modular (B) · Winner
0.97
75 tests · security 1.00 · 1MB JSON cap · ParseError class
Monolith (A)
0.82
36 tests · security 0.70 · no size limit · no validation
B's lead
+0.15 composite · 2× tests

1️⃣ Round 1: Small — safe-parse Utility

The task: Build a Node.js module with 3 parse functions (safeJsonParse, safeIntParse, safeDateParse), a full test suite, and TypeScript definitions. No dependencies.

DimensionMonolith (A)Modular (B)Edge
Code Quality0.851.00B
Test Coverage0.851.00B
Security0.701.00B
Completeness0.851.00B
Process0.850.85=
Documentation0.700.85B
Polish0.851.00B
Decisions0.851.00B
Composite0.820.97B wins (+0.15)

The gap: B produced 75 tests vs A’s 36. B had a 1MB JSON size limit, epoch range validation, structured ParseError with source tracking, and readonly on all result fields. A had none of that — including no size limit on a library called “safe-parse.”

Round 1 verdict: B wins. +0.15 margin driven by security (+0.30 gap) and double the test count. Not a coin flip — structural differences.

2️⃣ Round 2: Medium — Rate Limiter Middleware

The task: Express middleware with fixed window and sliding window algorithms, per-IP and per-API-key limiting, 429 responses with Retry-After headers, in-memory store, and concurrent request tests.

DimensionMonolith (A)Modular (B)Edge
Code Quality0.850.90B
Test Coverage0.700.90B
Security0.700.85B
Completeness0.850.85=
Process0.850.95B
Documentation0.850.85=
Polish0.850.85=
Decisions0.850.95B
Composite0.800.89B wins (+0.09)

A’s edge: 67% faster (4m37s vs 7m44s) and upgraded its mutex from spin-wait to FIFO queue during hardening. Genuine runtime improvement.

B’s edge: 46 tests vs A’s 23. Constructor validation with RangeError (A silently accepted capacity=0). Dedicated adversarial test suite with null bytes, XSS payloads, and special characters. A had zero adversarial tests.

New pattern emerging: Both skills self-scored ~0.10 above independent assessment. A self-scored 0.93 while actually scoring 0.80. B self-scored 0.96 while actually scoring 0.89. Neither is honest — but B is closer.

3️⃣ Round 3: Large — Markdown Link Checker CLI

The task: CLI tool that validates markdown files for broken links, checks image references, validates YAML frontmatter, supports glob patterns, has a --fix mode, and outputs in multiple formats. Full test suite with fixtures.

DimensionMonolith (A)Modular (B)Edge
Code Quality0.850.92B
Test Coverage0.250.95B
Security0.700.85B
Completeness0.920.85A
Process0.700.95B
Documentation0.920.85A
Polish0.850.90B
Decisions0.700.92B
Composite0.700.91B wins (+0.21)

This round broke it open. Output A had zero unit tests. No test framework. No test files. No test script in package.json. Only manual CLI verification against fixtures. It self-scored testing at 0.88, claiming this was “acceptable for a CLI tool of this scope.”

Output B produced 81 tests across 6 suites — parser, checker, fixer, scanner, CLI args, and config loading — plus adversarial inputs for null bytes, deeply nested brackets, and 10,000-character URLs. All passing.

A built more features

3 fix modes (unlink/comment/remove), --json output, --init config generation, bare URL detection, rate-limit backoff. The most complete feature set of any round.

A couldn’t verify they worked

Zero automated tests. Self-scored 0.92 composite on a 0-test output. Actual composite: 0.70. A gap of +0.22 between self-assessment and reality.

📊 The Cumulative Scorecard

CUMULATIVE: Monolith 0 — 3 Modular

Round 1 (Small):  B wins +0.15  (0.82 vs 0.97)
Round 2 (Medium): B wins +0.09  (0.80 vs 0.89)
Round 3 (Large):  B wins +0.21  (0.70 vs 0.91)

Four patterns held across every round:

🧪

B Always Tests

75, 46, and 81 tests across 3 rounds. A produced 36, 23, and 0. The modular skill never skipped testing. The monolith skipped it entirely on the largest task.

🛡️

B Always Hardens

Explicit security reviews in every round: size limits, input validation, ReDoS checks, path traversal. A’s hardening was implicit or absent.

🎯

B Self-Scores Honestly

B’s self-assessment was off by 0.01–0.04 from independent grading. A was off by 0.10–0.22. A scored itself 0.92 on a zero-test output.

A Is Faster

A completed tasks in 50–67% of B’s time. Speed was A’s only consistent advantage — and it came at the cost of everything else.

FOUR PATTERNS · INTERACTIVE Hover or click for round-by-round evidence
B Always Tests
B wrote 75 · 46 · 81 tests across the 3 rounds; A wrote 36 · 23 · 0. Modular never skipped tests — even on the largest task.
B Always Hardens
Explicit security reviews every round: size limits, input validation, ReDoS checks, path traversal. A's hardening was implicit or absent.
B Self-Scores Honestly
B's self-grade was off by ±0.01–0.04 from the independent grader. A was off by ±0.10–0.22 — once gave itself 0.92 on a zero-test output.
A Is Faster
A finished in 50–67% of B's time — the only consistent edge it had. Speed came at the cost of tests, hardening, and honest scoring.
3 patterns favour modular · 1 favours monolith · speed costs every other dimension

💡 Why Does the Same Content Perform Differently?

The instructions are identical. Word for word. So why does the format matter?

Think of it like a checklist on a clipboard vs a checklist on a poster. The poster has everything visible at once — but you can skip items because nothing forces you to look at each one individually. The clipboard makes you flip to each page and check it off. Same items, but the clipboard creates a natural gate at each step.

When all phases live in one file, Claude can skim. It reads the full protocol, forms a plan, and executes from memory. Phases it considers “less important” for the task — like adversarial testing on a CLI tool — get silently dropped.

When each phase is a separate file, Claude has to explicitly load it. The act of reading scripts/phase-3-test.md puts that phase’s full checklist into working memory right when it’s needed. Harder to skip what you just read.

COMPANION FILES = PAGED MEMORY 14 files exist. Only the orchestrator + one phase script live in working memory at any time.
SKILL.md
orchestrator · slim router · resident
LOAD
scripts/phase-3-build.md
Build — execute the spec · 12 KB
14files available
2in working set
38%context used
Monolith: all 4 files always resident · 86% context · phases skim instead of execute

⚖️ The Tradeoff Is Real

Modular isn’t strictly better. There’s a genuine cost:

FactorMonolithModular
Speed50–67% fasterSlower (more file reads)
Test coverageInconsistentAlways present
Security reviewImplicit/absentExplicit every round
Feature breadthMore features builtFewer, better-verified
Self-assessmentOff by 0.10–0.22Off by 0.01–0.04
Context overhead~4 file reads~10–14 file reads

If you need fast iteration and plan to review the output yourself, the monolith’s speed advantage matters. If you need to ship what it produces, the modular version’s verification pays for itself.

🧰 What This Means for Skill Authors

If your skill has phases, steps, or checklists that must all execute — split them into separate files. The file boundary acts as a natural attention gate. Same content, better follow-through.

Four concrete rules from this experiment:

  1. One concern per file. Testing instructions in one file, security review in another. Don’t bundle them.
  2. The orchestrator stays slim. It routes to scripts. It doesn’t contain instructions itself. If the orchestrator has checklists, they’ll get skimmed.
  3. Accept the speed cost. More file reads means slower execution. That’s the price of reliability.
  4. Verify your own skill’s self-scoring. Both skills inflated their scores. Build an independent grader. Trust the grader, not the skill.

See One-Shot Execute a Real Task

Watch every phase, every score, every decision — from prompt to delivery in one live run.

Watch the Live Run Get Godmode