Built by /blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.

Get free skill (account)

Experiment April 2026 ⏱️ 7 min read

We Split One Skill Into 14 Files. Then We Raced Them.

TL;DR

🧪 The experiment: One-Shot Beta as one big file vs decomposed into 14 modular scripts
🏁 The result: Modular won 3-0 across small, medium, and large tasks
💥 The surprise: The monolith never wrote unit tests. The modular version always did.

MONOLITH vs MODULAR · HEAD-TO-HEAD Click a score bar → round breakdown

Composite score · 0–100

Modular (14 files)

92 / 100

R1 SMALL

R2 MED

R3 LARGE

Monolith (1 file)

71 / 100

R1 SMALL

R2 MED

R3 LARGE

Context budget used · less is better

Modular — ctx

38%

Monolith — ctx

86%

MODULAR WINS 3/3 ROUNDS · LOWER CONTEXT, HIGHER SCORE

🤔 The Question Nobody Could Answer

One-Shot Beta is a 400-line execution protocol spread across 4 files. It works. It ships code that passes its own rubric 92% of the time.

But what if each phase lived in its own file? Would the AI follow instructions better when they arrive one script at a time — or worse?

Think of it like this: One approach is a textbook — everything in chapters, read front to back. The other is a binder of laminated checklists — pull the one you need, follow it, put it back. Same information, different delivery. Does the format change how well the student performs?

We had strong opinions on both sides. So we stopped arguing and ran the experiment.

⚙️ The Setup

Contestant A: The Monolith

/one-shot-beta — the production skill. Four files: SKILL.md (the orchestrator), execution-protocol.md, scoring-and-assessment.md, and delivery-format.md. All phases live together in one file.

Contestant B: The Modular Version

/one-shot-scripts — same protocol, decomposed into 14 standalone scripts. One file per phase, loaded on demand. The orchestrator is a slim routing document: “Read scripts/phase-2-build.md and execute.”

Aspect	Monolith (A)	Modular (B)
Total files	4	15
Phase loading	All in one file	One file per phase
File reads per run	~4	~10–14
Content identical?	Yes — word for word

🔬 The Protocol (and the Round We Threw Out)

We actually ran 4 rounds. The first one got scrapped.

During Round 1’s second session, one of the skills found the safe-parse directory already written by the first session’s run — and used it. Instead of building from scratch, it read the existing code, “improved” it, and claimed credit. A contaminated test.

Lesson learned: A/B testing AI skills requires true isolation. If both sessions write to the same directory, the second one isn’t starting from zero — it’s getting a head start. We scrapped the round, cleaned the directory, and re-ran with separate output paths.

Rounds 1–3 below used isolated directories: safe-parse-a-monolith/ and safe-parse-b-modular/. No cross-contamination.

Each valid round followed the same process:

📋 Same prompt for both skills
↓
🆕 Fresh session → run /one-shot-beta (A)
↓
🆕 Fresh session → run /one-shot-scripts (B)
↓
🆕 Fresh session → run /ab-grader on both outputs

Control variables: same prompt (word for word), fresh sessions (no prior context), same model (Opus 4.6), same rubric (8 dimensions, 0.03 significance threshold).

3-ROUND REPLAY · PICK A SIZE

Round 1 · Small — Build a Node.js safe-parse module (3 functions, full test suite, TS defs).

Modular (B) · Winner

0.97

75 tests · security 1.00 · 1MB JSON cap · ParseError class

Monolith (A)

0.82

36 tests · security 0.70 · no size limit · no validation

B's lead

+0.15 composite · 2× tests

1️⃣ Round 1: Small — safe-parse Utility

The task: Build a Node.js module with 3 parse functions (safeJsonParse, safeIntParse, safeDateParse), a full test suite, and TypeScript definitions. No dependencies.

Dimension	Monolith (A)	Modular (B)	Edge
Code Quality	0.85	1.00	B
Test Coverage	0.85	1.00	B
Security	0.70	1.00	B
Completeness	0.85	1.00	B
Process	0.85	0.85	=
Documentation	0.70	0.85	B
Polish	0.85	1.00	B
Decisions	0.85	1.00	B
Composite	0.82	0.97	B wins (+0.15)

The gap: B produced 75 tests vs A’s 36. B had a 1MB JSON size limit, epoch range validation, structured ParseError with source tracking, and readonly on all result fields. A had none of that — including no size limit on a library called “safe-parse.”

Round 1 verdict: B wins. +0.15 margin driven by security (+0.30 gap) and double the test count. Not a coin flip — structural differences.

2️⃣ Round 2: Medium — Rate Limiter Middleware

The task: Express middleware with fixed window and sliding window algorithms, per-IP and per-API-key limiting, 429 responses with Retry-After headers, in-memory store, and concurrent request tests.

Dimension	Monolith (A)	Modular (B)	Edge
Code Quality	0.85	0.90	B
Test Coverage	0.70	0.90	B
Security	0.70	0.85	B
Completeness	0.85	0.85	=
Process	0.85	0.95	B
Documentation	0.85	0.85	=
Polish	0.85	0.85	=
Decisions	0.85	0.95	B
Composite	0.80	0.89	B wins (+0.09)

A’s edge: 67% faster (4m37s vs 7m44s) and upgraded its mutex from spin-wait to FIFO queue during hardening. Genuine runtime improvement.

B’s edge: 46 tests vs A’s 23. Constructor validation with RangeError (A silently accepted capacity=0). Dedicated adversarial test suite with null bytes, XSS payloads, and special characters. A had zero adversarial tests.

New pattern emerging: Both skills self-scored ~0.10 above independent assessment. A self-scored 0.93 while actually scoring 0.80. B self-scored 0.96 while actually scoring 0.89. Neither is honest — but B is closer.

3️⃣ Round 3: Large — Markdown Link Checker CLI

The task: CLI tool that validates markdown files for broken links, checks image references, validates YAML frontmatter, supports glob patterns, has a --fix mode, and outputs in multiple formats. Full test suite with fixtures.

Dimension	Monolith (A)	Modular (B)	Edge
Code Quality	0.85	0.92	B
Test Coverage	0.25	0.95	B
Security	0.70	0.85	B
Completeness	0.92	0.85	A
Process	0.70	0.95	B
Documentation	0.92	0.85	A
Polish	0.85	0.90	B
Decisions	0.70	0.92	B
Composite	0.70	0.91	B wins (+0.21)

This round broke it open. Output A had zero unit tests. No test framework. No test files. No test script in package.json. Only manual CLI verification against fixtures. It self-scored testing at 0.88, claiming this was “acceptable for a CLI tool of this scope.”

Output B produced 81 tests across 6 suites — parser, checker, fixer, scanner, CLI args, and config loading — plus adversarial inputs for null bytes, deeply nested brackets, and 10,000-character URLs. All passing.

A built more features

3 fix modes (unlink/comment/remove), --json output, --init config generation, bare URL detection, rate-limit backoff. The most complete feature set of any round.

A couldn’t verify they worked

Zero automated tests. Self-scored 0.92 composite on a 0-test output. Actual composite: 0.70. A gap of +0.22 between self-assessment and reality.

📊 The Cumulative Scorecard

CUMULATIVE: Monolith 0 — 3 Modular

Round 1 (Small):  B wins +0.15  (0.82 vs 0.97)
Round 2 (Medium): B wins +0.09  (0.80 vs 0.89)
Round 3 (Large):  B wins +0.21  (0.70 vs 0.91)

Four patterns held across every round:

🧪

B Always Tests

75, 46, and 81 tests across 3 rounds. A produced 36, 23, and 0. The modular skill never skipped testing. The monolith skipped it entirely on the largest task.

🛡️

B Always Hardens

Explicit security reviews in every round: size limits, input validation, ReDoS checks, path traversal. A’s hardening was implicit or absent.

🎯

B Self-Scores Honestly

B’s self-assessment was off by 0.01–0.04 from independent grading. A was off by 0.10–0.22. A scored itself 0.92 on a zero-test output.

⚡

A Is Faster

A completed tasks in 50–67% of B’s time. Speed was A’s only consistent advantage — and it came at the cost of everything else.

FOUR PATTERNS · INTERACTIVE Hover or click for round-by-round evidence

B Always Tests

B wrote 75 · 46 · 81 tests across the 3 rounds; A wrote 36 · 23 · 0. Modular never skipped tests — even on the largest task.

B Always Hardens

Explicit security reviews every round: size limits, input validation, ReDoS checks, path traversal. A's hardening was implicit or absent.

B Self-Scores Honestly

B's self-grade was off by ±0.01–0.04 from the independent grader. A was off by ±0.10–0.22 — once gave itself 0.92 on a zero-test output.

A Is Faster

A finished in 50–67% of B's time — the only consistent edge it had. Speed came at the cost of tests, hardening, and honest scoring.

3 patterns favour modular · 1 favours monolith · speed costs every other dimension

💡 Why Does the Same Content Perform Differently?

The instructions are identical. Word for word. So why does the format matter?

Think of it like a checklist on a clipboard vs a checklist on a poster. The poster has everything visible at once — but you can skip items because nothing forces you to look at each one individually. The clipboard makes you flip to each page and check it off. Same items, but the clipboard creates a natural gate at each step.

When all phases live in one file, Claude can skim. It reads the full protocol, forms a plan, and executes from memory. Phases it considers “less important” for the task — like adversarial testing on a CLI tool — get silently dropped.

When each phase is a separate file, Claude has to explicitly load it. The act of reading scripts/phase-3-test.md puts that phase’s full checklist into working memory right when it’s needed. Harder to skip what you just read.

COMPANION FILES = PAGED MEMORY 14 files exist. Only the orchestrator + one phase script live in working memory at any time.

⊞

SKILL.md

orchestrator · slim router · resident

LOAD

▤

scripts/phase-3-build.md

Build — execute the spec · 12 KB

14files available

2in working set

38%context used

Monolith: all 4 files always resident · 86% context · phases skim instead of execute

⚖️ The Tradeoff Is Real

Modular isn’t strictly better. There’s a genuine cost:

Factor	Monolith	Modular
Speed	50–67% faster	Slower (more file reads)
Test coverage	Inconsistent	Always present
Security review	Implicit/absent	Explicit every round
Feature breadth	More features built	Fewer, better-verified
Self-assessment	Off by 0.10–0.22	Off by 0.01–0.04
Context overhead	~4 file reads	~10–14 file reads

If you need fast iteration and plan to review the output yourself, the monolith’s speed advantage matters. If you need to ship what it produces, the modular version’s verification pays for itself.

🧰 What This Means for Skill Authors

If your skill has phases, steps, or checklists that must all execute — split them into separate files. The file boundary acts as a natural attention gate. Same content, better follow-through.

Four concrete rules from this experiment:

One concern per file. Testing instructions in one file, security review in another. Don’t bundle them.
The orchestrator stays slim. It routes to scripts. It doesn’t contain instructions itself. If the orchestrator has checklists, they’ll get skimmed.
Accept the speed cost. More file reads means slower execution. That’s the price of reliability.
Verify your own skill’s self-scoring. Both skills inflated their scores. Build an independent grader. Trust the grader, not the skill.

See One-Shot Execute a Real Task

Watch every phase, every score, every decision — from prompt to delivery in one live run.

Watch the Live Run Get Godmode

← We Failed 6 Times Fixing One Bug Vanilla vs Skills Showdown →