Post-Mortem June 8, 2026 ⏱️ 7 min read

We Audited Our Own Quality Loop. It Failed Our Best Work.

TL;DR

The audit: We pointed 67 adversarial agents across 12 dimensions at One-Shot Orchestra, our self-healing AI build system. Grade: C+.

The hole: The quality loop had bugs in the one thing it exists to guarantee. The scorer read the wrong field, so every dimension came back blank, flawless work scored 0.50, and the gate failed it.

Two more: A retry shipped the previous attempt's stale files. A throttled worker's empty result was accepted as success, and a whole slice of the build vanished.

The result: Every finding fixed. The offline suite is back to 11 of 11 green, a re-audit confirms the holes are closed, and the self-healing patcher (which had deadlocked itself) heals again.

🎻 What we pointed the audit at

One-Shot Orchestra builds software with a team of AI workers instead of one. A thin "narrator" drives a daemon, and every worker spawns in a fresh, empty context window so nothing leaks between them.

The part under the microscope here is the quality loop. Workers do the job, a scorer grades the result across several dimensions, and a gate decides: ship it, or loop back and try again until the work clears a bar.

The loop is sound in shape. The audit found that the read step in the middle was quietly wrong.

An "audit" here is not one engineer reading code. It is a fan-out of independent agents, each owning one angle, with a second pass that tries to refute every finding before it counts.

67 agents

Each took one slice of the system and reported only what it could prove with a file and a line number.

12 dimensions

Correctness, concurrency, state handling, docs, and nine more. No single reviewer holds all twelve in their head.

Adversarial pass

Every finding was handed to a skeptic told to kill it. Three findings died there as false alarms.

Receipts only

A claim with no reproducible evidence did not make the report. Opinions are not findings.

🎯 The grade was C+

The verdict, in the audit's own words: architecturally solid, but the quality-loop core has correctness holes and the docs had drifted badly. Solid bones, soft center.

Here is the full shape of what came back, sorted by how much each tier could hurt you.

Severity	Count	What it means
Critical	3	Can ship broken or wrong work silently.
High	11	Real bugs, narrower blast radius.
Medium	19	Correctness or clarity issues worth fixing.
Low / Nit	~30	Polish, naming, small inconsistencies.
Doc drift	~13	Docs describing a version that no longer existed.
Refuted	3	Flagged, then killed by the skeptic pass.

The three Critical findings are the story. Each one is a different way the loop could hand you a build that looks graded but is not. Click through them:

💥 Defect one: the grader read the wrong number

This is the one that stings. The workers were doing great work, and the gate was failing it anyway.

Workers write their scores to a top-level score field. The gate read digest.score instead. That field does not exist, so every read came back empty.

Picture a factory inspector reading a gauge that was never plugged in. The needle sits at zero, so every part gets stamped REJECT. The parts are fine. The inspector is staring at the wrong dial.

When every dimension reads as blank, the composite math has nothing to average and collapses to a 0.50 floor. The gate wants 0.92 to ship.

So flawless work scored 0.50, failed the gate, and looped back to be redone, forever. Flip the toggle below between the broken read and the fixed read on the exact same worker output.

The fix was small and blunt: the scorer now reads the top-level score and keeps digest.score only as a fallback, with a warning if a value lands outside the valid range. Same data, right dial.

🔁 Defect two: the retry shipped stale work

When the gate sends work back for another attempt, the loop re-runs the fan-out of workers. The bug: it reused the previous attempt's bookkeeping about which pieces were done.

So a second attempt could look at a stale "done" mark from attempt one and ship those old files, skipping the fresh work it just asked for. The loop meant to improve the build could quietly serve you a worse, older one.

▲ Before

attempt 2 starts

reads node status from attempt 1

node A: done (stale)

node B: done (stale)

ships: attempt 1 files

the redo never happened

▲ After

attempt 2 starts

clears node status first

node A: pending → rebuilt

node B: pending → rebuilt

ships: attempt 2 files

the redo actually runs

A looping retry now resets its own scoreboard before it runs, so "done" always means done this time.

The fix resets the subtree's node statuses at the start of every looped attempt. A retry can no longer inherit a stale "done" from the run before it.

🕳️ Defect three: a ghost worker passed inspection

Big builds split the work across parallel workers, each owning one slice. When a machine runs hot, the watchdog throttles a worker, and it can finish without writing a real result.

The bug accepted that throttled, empty result as a success and merged it. The worker's whole slice of the build silently disappeared from the final output, and nothing flagged it.

It is the opposite mistake from an earlier orphaned-worker post-mortem. That time, real work on disk got scored as a failure. This time, an absent worker's empty station got waved through as done. Same root: trusting a status label over the actual artifact.

Throttled or empty now counts as failed, not done. The slice gets rebuilt instead of vanishing.

The fix makes the merge step status-aware. A throttled or empty worker is classified as failed, which reruns its slice instead of accepting a hole.

🔧 The self-healer that deadlocked itself

One more, because it is the most ironic. The runner can patch its own bugs at runtime, then smoke-test the patch before keeping it. That safety check was eating every patch alive.

The patch process holds a lock while it works. Its own smoke test then launched a child that tried to read status, which waited on that same lock, which never freed. The check timed out, so the runner assumed the patch was bad and rolled it back. Every time.

🔧 Patch holds the run lock
↓
🔍 Smoke test spawns a child to read status
↓
🛑 Child waits on the lock the patch still holds
↓
⏱️ Timeout: exit 17
↓
♻️ Runner assumes the patch is broken and rolls it back

The fix gives the smoke-test child a read-only mode that skips the lock entirely. The check can now inspect the patched runner without fighting the patch for the same key.

A self-healing system has to survive its own cure. The patch logic was correct. The thing meant to verify it was standing on the patch's air hose.

🧪 Then we re-ran everything

A fix you have not re-checked is a hope, not a fix. So after every change went in, we ran the whole audit and test cycle again from a clean read.

What the re-audit confirmed: we independently re-read the current code for all 3 Critical and all 11 High findings. Every fix was present and complete. None were missing, none half-done.

Re-check	Result
3 Critical findings, re-read in current code	all 3 fixed
11 High findings, re-read in current code	all 11 fixed
Offline test suite	11 / 11 green
Self-heal deadlock, live reproduction	4 / 4 assertions pass
Doc drift (versions, caps, step numbers)	reconciled

The self-heal fix got the hardest check. We held the lock with a live process, then proved the behavior end to end.

🔒 Hold the run lock with a live worker
↓
🛑 Plain status check → exits 17 (correctly blocked)
↓
✅ Read-only status check → runs clean, reports delivered
↓
🔐 The held lock is never touched

Do

Re-read the live code for every fix, then re-run the suite. Confirm the green, do not assume it.

Don't

Trust that a patch landed because you wrote it. The audit found three "done" things that were not done.

💡 The lesson under all four

Strip the details and every defect is the same shape. The code trusted a status field instead of the real thing it was supposed to measure.

A blank dimension read became a 0.50. A stale "done" became a shipped file. A throttled stub became a success. A held lock became a "patch is broken." Convenient signals, all lying about the artifact underneath.

A quality loop is only as honest as the field it reads. If the number it grades is not the number the work produced, the loop is theater. We rebuilt ours to read the real one. This is the same instinct behind deleting the judge and teaching the conductor to read every score.

Read the source

Grade the field the worker actually wrote, not the one the schema wishes it wrote.

Reset before retry

A loop that reuses old state will hand you old work. Clear the board each pass.

Empty is not done

A worker that produced nothing has not succeeded. Make absence a failure, not a pass.

Run a loop that grades the real number

One-Shot Scripts ships the daemon-driven, quality-gated orchestra protocol from this post, audited and re-audited. Workers run in fresh context windows and the gate reads what the work produced.

See One-Shot Pricing

← We Threw Out 132,000 Skills All posts →