The Eight-Hour Silence: When Orchestra Stopped Spawning
🔇 The symptom: An Orchestra run spawned workers cleanly for eight hours, then every subsequent fan-out silently failed.
🔍 The forensics: Briefs written, but the shell scripts that launch the workers never appeared on disk. No process, no result file, no error.
🐛 The cause: Four bugs in one trench coat, each layered over the next so the symptom looked like nothing at all.
✅ The fix: Make the spawn loop transactional, defer instead of die under load, lock out concurrent narrators, and trap shell-level failures with a loud sentinel file.
Build Stats
| Metric | Value |
|---|---|
| Skill | one-shot-orchestra |
| Version | 0.25.2 → 0.26.0 (runner 1.11.2 → 1.12.0) |
| Files changed | 12 (3 new, 9 modified) |
| Lines | +1,031 / −48 |
| Verification tests | 8 (run-lock, spawn trap, throttle bypass, concurrent-narrator block, transactional fan-out, JS syntax, shell syntax) |
| Commit | aa32365d |
The Eight-Hour Silence
Picture a long run on Orchestra: a migration job, eleven phases, three or four parallel workers per phase. For eight hours every spawn is clean.
Diagnose, Recon, Research times three, Planner, Builder times four, Merger. The chat log fills with green checkmarks.
Then the next phase starts, and nothing happens.
No claude.exe in Task Manager. No worker windows opening.
The narrator session keeps polling and waiting, polling and waiting. Eventually it gives up and declares the run partial-unverified.
The user looks at the chat log and asks the obvious question: where are the workers?
Core insight: The worst kind of automation failure is the one that looks like a long task. If a system can wait silently forever, sooner or later it will, and the user will only notice when they check back hours later.
Forensics on Disk
Orchestra leaves a paper trail. Every worker spawn is supposed to write three files in order: a brief-NAME.md (the prompt), a .prompt-NAME.txt (the wrapped command), and a .run-NAME.sh (the shell script that actually launches the worker process).
For the workers that vanished, only the brief existed. The other two files, and the worker process itself, never appeared.
brief-Test-1.md ✓ written
↓
📝
.prompt-Test-1.txt ✗ missing
↓
🐚
.run-Test-1.sh ✗ missing
↓
💀 No claude.exe process. No result file. No error. No log line. Nothing.
That gap, between "brief written" and "shell script written", was where the spawn pipeline died. Whatever broke, it broke quietly enough that not even the watchdog noticed.
Four Bugs in One Trench Coat
The investigation turned up four separate bugs, each one masking the others. Hover the failure tags below to see where each bug strikes the spawn pipeline.
Fix #1: A Loop That Survives Mid-Step Failure
The fan-out spawn loop iterated through N workers. If the third one threw, the loop aborted, workers four and five never got a chance, and the state file was never written.
The fix is simple in shape: each iteration runs in its own try/catch. Failures fall into three buckets (spawned, deferred, failed), and the state always gets written at the end with all three lists.
// for each worker N in the fan-out batch
let result;
try {
result = spawnWorker({ runDir, name, briefPath });
} catch (err) {
failed.push({ name, reason: err.message });
continue;
}
if (result?.deferred) {
deferred.push({ name, reason: result.reason });
continue;
}
spawned.push(name);
One bad worker no longer takes the others down with it. The narrator gets a complete picture: who launched, who got pushed back, who failed outright.
Fix #2: Defer, Don't Die
Orchestra has a pre-flight throttle that checks CPU, RAM, claude.exe count, disk queue, and a few other signals. When the host is under pressure, the throttle blocks the spawn so the machine doesn't tip over.
The old behaviour: throw an exception. Combined with bug one, that exception killed the whole batch.
The new behaviour: write a throttled-result file, return { deferred: true, reason }, and let the caller decide. The caller (now transactional from fix one) drops the worker into the deferred bucket. The narrator's next orchestra spawn call picks them up after the host clears.
Rule of thumb: Transient pressure should produce a deferred state, not an exception. Exceptions in concurrent code propagate in unhelpful ways. A flag that says "try me again later" is cooperative.
Fix #3: One Conductor at a Time
The May 3 chat log confirmed it: when the original narrator hit a context limit, a handover spawned a second narrator session. Both were calling orchestra spawn on the same run-id, racing on the briefs directory, the state file, and the per-worker shell scripts.
Two conductors, one orchestra, predictable result. The fix is a per-run lock at <run-dir>/.narrator.lock: an atomic mkdir with the holder's PID inside it. Every mutating subcommand acquires the lock, runs, and releases.
Do
Atomic mkdir as the lock primitive. mkdir is atomic on Windows and POSIX. Stale-PID detection via process.kill(pid, 0) handles the case where a holder died abnormally.
Don't
Use a plain file with read-modify-write. There is always a window where two writers see no lock and both create one.
If a second narrator hits a locked run, it gets a clear envelope back: narrator_already_active, the holder's PID, and a hint to run orchestra unlock only if you're sure the holder is gone.
Fix #4: Fail Loud or Don't Fail At All
The bash script that launches each worker had set -euo pipefail at the top. Any failure inside it caused a silent exit, because the JS caller spawned bash with stdio: 'ignore', detached: true, child.unref(), sending the exit code into the void.
Now the script installs a trap on ERR and EXIT. If anything inside the launcher fails before the worker actually launches, the trap fires and writes two files: a .spawn-NAME.error sentinel with the line number and the failing command, plus a loud result-NAME.json with status "failed" so the existing detection path picks it up immediately.
Status reads now check for the sentinel at the top of every poll. A spawn-time failure that used to take ten minutes (the heartbeat-timeout window) to surface now surfaces in seconds.
Before vs After
The interactive below replays a five-worker fan-out under realistic stress: worker three hits the throttle, worker four hits a hard error. Click "Old behaviour" then "New behaviour" to compare.
The Lesson
Think of a factory production line. Each station has a green light when it's working and a red one when it's stopped, so the floor manager can spot a stalled station from across the building.
What broke on May 3 was the equivalent of a station whose worker walked off shift mid-task, but the green light stayed on. The line just kept feeding work into the dark, and the manager only noticed eight hours later when the warehouse next door called to ask why the conveyor was empty.
The takeaway: Reliable automation needs three things in concert. Surface errors fast (so the green light goes red the moment the worker leaves), recover gracefully from transient pressure (so the line pauses without crashing), and prevent two operators from racing the same job (so two managers can't both think they're running the floor).
The fix shipped as Orchestra 0.26.0 and is live in Godmode and the Ultimate Bundle. Existing runs benefit on the next orchestra spawn call, so if you're driving a long Orchestra run right now, you already have it.
Honest trade-off: the run-lock assumes the holder narrator is reachable via PID. On the same machine that's reliable, but a future shared-filesystem deployment would need a heartbeat file with a TTL on top of the mkdir lock.
Run protocols that don't go silent
Orchestra ships in Godmode and the Ultimate Bundle. Hardened spawn pipeline, deferred-and-retry under load, locked against concurrent narrators, loud about failure.
Get Access More posts