// independent blind review

June 13, 2026 · double-blind A/B · published verbatim

The showcase scores are graded by the skills' own dimensional scoring protocol. A fair objection is that a tool grading its own output is not independent. So we ran a blind panel. For each of the 14 build pairs, we stripped every tier label from both implementations, randomly assigned them to slots A and B (by sha256("gm-independent-review-2026-06-13:" + slug)), then handed the two anonymous code folders and screenshots to 3 fresh Claude agents with zero knowledge of which tool produced either side, which products exist, or that this page would be published. Each judge read all the code and scored both on six dimensions. Presentation order was alternated across each pair's judges to control for order bias. 42 independent verdicts in total. Every number below is reported exactly as the judges returned it, wins and losses alike.

Result

Godmode wins

Vanilla wins

Ties

+0.13

Mean composite Δ

Across 14 pairs and 42 blind verdicts, the panel preferred Godmode in 14, vanilla in 0, and called 0 a tie. 14 of 14 pairs were unanimous among their judges. Mean judge confidence 0.85.

Mean score by dimension

All 42 verdicts	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.89	0.37	0.85	0.82	0.92	0.89	0.79
Vanilla	0.77	0.25	0.80	0.66	0.77	0.71	0.66

Composite = unweighted mean of the six dimensions. Scores 0.00–1.00.

Every pair

3d-chess

Panel verdict: Godmode (unanimous) | Δ +0.21

Brief: Make an advanced 3D chess game

Blind assignment this run: A = godmode, B = vanilla. Compare both yourself: /showcase/3d-chess.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.84	0.18	0.81	0.71	0.91	0.90	0.72
Vanilla	0.73	0.18	0.80	0.67	0.45	0.25	0.51

Per-judge rationale

Judge 1 — picked Godmode (conf 0.90; Godmode 0.72 / Vanilla 0.49)
A ships a complete, polished, correctly-rendering 3D chess game (A.png shows distinct LatheGeometry pieces, gold trim, shadows, highlights, captured trays, move list) with clean module separation (main/pieces/ai.js + chess.js for rules) and a rich feature set: animations, promotion/game-over modals, sound, keyboard shortcuts, 3 modes, 4 AI depths. B is more ambitious on paper (a competent hand-rolled engine with castling/en-passant/SAN and quiescence search) but the shipped artifact is visually broken: B.png shows flat dark tiles with no recognizable 3D pieces standing on the board, so the core "3D chess" deliverable does not render. B also has a real flip bug (pieces are added to the scene, not boardGroup, so rotating boardGroup leaves them behind). Since completeness and ux are weighted highest, B's non-rendering pieces are disqualifying; A wins decisively.
Judge 2 — picked Godmode (conf 0.95; Godmode 0.73 / Vanilla 0.54)
Both are sophisticated, but the brief's core deliverable is a playable, visible 3D board. A renders flawlessly (lathe-turned pieces, gold trim, shadows, full UI) and ships a complete feature set (legal moves via chess.js, minimax+alpha-beta AI with PST/mobility/pawn-structure, promotion, castling/en-passant animation, captured material, move history, undo, flip, sound, check highlights) — confirmed via a fresh headless render matching the provided screenshot. B has a deeper from-scratch engine (custom move-gen, make/unmake, quiescence search, SAN disambiguation, full draw rules) but its hand-rolled mergeGeometries (B/index.html lines 1010-1058) produces garbled, sheared geometry: the board is distorted and no pieces are visible, making the game literally unplayable — a disqualifying completeness/UX defect confirmed by re-rendering. Since the brief weights completeness and ux highest, A wins decisively.
Judge 3 — picked Godmode (conf 0.92; Godmode 0.72 / Vanilla 0.51)
A renders correctly and completely: the screenshot shows proper lathed 3D pieces, gold trim, shadows, and a full starting position, backed by clean modular code (main.js/pieces.js/ai.js) that delegates all rules to the proven chess.js library. B's screenshot is a disqualifying defect for a "3D chess game" — the board renders as dark, sheared, torn geometry with no recognizable pieces, caused by its hand-rolled mergeGeometries() function (B/index.html lines 1010-1058) producing degenerate meshes. B's underlying from-scratch engine (alpha-beta + quiescence, full legal-move gen, draw detection) is genuinely strong, but the artifact does not visually work, and the brief's headline requirement is the 3D rendering. Neither ships tests; A's reliance on a battle-tested library makes it markedly more robust than B's all-bespoke approach that already broke in the renderer.

boil-egg

Panel verdict: Godmode (unanimous) | Δ +0.10

Brief: Make an instructional animation on how to boil an egg

Blind assignment this run: A = vanilla, B = godmode. Compare both yourself: /showcase/boil-egg.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.90	0.28	0.84	0.81	0.93	0.92	0.78
Vanilla	0.77	0.23	0.84	0.62	0.79	0.81	0.68

Per-judge rationale

Judge 1 — picked Godmode (conf 0.85; Godmode 0.77 / Vanilla 0.67)
Both are complete single-file SVG instructional animations with doneness presets, step navigation, timers and keyboard support, and both are XSS-safe (textContent for dynamic copy, innerHTML only on author-controlled static strings). B wins on the two weighted dimensions: it implements a richer, more pedagogically correct method (8 steps including the salt step and the lid-on/off-heat resting technique), 4 doneness levels with a genuinely informative cross-section that color-tweens the yolk and reveals a green overcooked ring for hard eggs (DONENESS table + applyDonenessVisuals, B/index.html:1124-1236), plus a countdown timer, speed control, and prefers-reduced-motion support. B's animation core is also markedly more maintainable: one rAF delta-time tick loop driving a declarative CSS state machine via data-* attributes, versus A's imperative timeout/interval juggling where the timeouts array mixes raw IDs and {clear} objects and clearTimeouts() (A/index.html:558) only handles one shape, risking leaked intervals on some reset paths. A is solid and clean (tidy IIFE, no global leakage) but offers fewer steps, no reduced-motion, and a less informative cross-section. B's only real cost is an external Google Fonts dependency, which degrades gracefully via system-font fallbacks.
Judge 2 — picked Godmode (conf 0.82; Godmode 0.79 / Vanilla 0.68)
Both are self-contained single-file HTML/SVG animations that fully implement the brief: stepped boil-egg walkthrough, doneness presets, play controls, keyboard support, and a cross-section. B is the stronger artifact on the two weighted dimensions. On completeness, B covers a more accurate technique (cold-start, salt, lid-off-heat, ice bath, 8 steps with 4 doneness levels incl. jammy) and adds a live color-lerped cross-section with an overcooked green-ring for hard eggs (DONENESS table + applyDonenessVisuals, index.html:1124/1216), plus a speed control and prefers-reduced-motion handling (index.html:647). On UX, B's typographic editorial design, always-visible cross-section, countdown timer, and animated step list read as more polished in the screenshot, while A hides the cross-section/tip/step-list entirely on mobile (A index.html:265-267). A is solid and its cumulative-state rebuild is clever, but it carries a real defect: clearTimeouts() (A index.html:558) runs clearTimeout over a mixed array that also holds {clear:fn} interval-wrapper objects, so that path silently fails to cancel intervals (it's only saved because enterStep also calls the correct clearIntervalsTimeouts). B uses a cleaner declarative CSS state machine driven by data-attributes and a single rAF tick with delta timing, which is more maintainable. Neither ships any tests or assertions, so both score low on testing.
Judge 3 — picked Godmode (conf 0.86; Godmode 0.78 / Vanilla 0.68)
Both are single-file HTML/CSS/SVG instructional animations with doneness presets, step navigation, a play loop, and keyboard controls, and both render cleanly. B is the stronger hand-off: it implements 8 steps with real cooking nuance (salt, lid-on/off-heat, ice bath, peel) versus A's 6, offers 4 doneness levels with a live cross-section that tweens yolk colour, a runny overlay, and an overcooked green ring (lerpColor/easeOutCubic in the script), plus a countdown timer, 1x/2x/3x speed control, prefers-reduced-motion handling and ARIA roles. B's architecture is also more maintainable: it drives all scene animation declaratively through data-* attributes on the SVG (applyStepFlags), whereas A juggles imperative setTimeout/setInterval handles with two different cleanup paths (clearTimeouts at line 558 only clears plain timeouts and silently no-ops on the interval-objects it is handed in resetStage). A is solid and visually clean, but B simply covers more of the brief with better polish.

code-editor

Panel verdict: Godmode (unanimous) | Δ +0.27

Brief: Create a browser-based code editor with syntax highlighting, multiple tabs, line numbers, and theme switching.

Blind assignment this run: A = godmode, B = vanilla. Compare both yourself: /showcase/code-editor.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.91	0.42	0.85	0.87	0.93	0.91	0.81
Vanilla	0.68	0.30	0.67	0.57	0.69	0.36	0.54

Per-judge rationale

Judge 1 — picked Godmode (conf 0.90; Godmode 0.80 / Vanilla 0.56)
A (Forge) ships a clean, working editor built on CodeMirror 5 with 27 languages, 20 light/dark themes, Doc-swapping multi-tab model, localStorage persistence (with size caps + dirty/beforeunload guards), file open/save/drag-drop, find/replace/jump-to-line and format; its screenshot renders correctly. B (CodePad) is a more ambitious from-scratch regex highlighter with a minimap and find bar, but its rendered artifact is broken: index.css lines 443-469 contain a stray duplicated `body { display: grid; grid-template-columns: 240px 1fr }` block (accidentally extracted from the DEMO_HTML string by scripts/h1-extract.mjs) that overrides the real `body { display: flex }`, jamming the editor into a 240px column with a huge empty area — exactly what B.png shows. B also never persists tabs across reload (only theme) and has a wrong scroll-sync target at index.html line 549. The layout defect is disqualifying for the core editing surface, so A wins on both completeness and ux.
Judge 2 — picked Godmode (conf 0.93; Godmode 0.85 / Vanilla 0.54)
A (Forge) renders as a complete, polished editor: CodeMirror-backed syntax highlighting, working multi-tab model with per-tab Docs, line numbers, 20 themes, plus localStorage persistence, file-size guards (2MB/5MB), dirty-close confirms, drag-drop, and accessibility roles — all visibly working in A.png. B (CodePad) has a disqualifying defect: B/index.css line 453 ships a stray leaked `body { display: grid; grid-template-columns: 240px 1fr }` (extracted from the DEMO_HTML content by h1-extract.mjs) that overrides the intended flex layout, which is exactly why B.png shows the editor crammed into a ~240px column with a huge dead empty pane. B's underlying editor logic (custom regex highlighter, tabs, themes, minimap, find) is competent in code, but the shipped artifact is broken on load, so A is the one I would hand back. Weighting completeness and ux as instructed, A wins decisively.
Judge 3 — picked Godmode (conf 0.90; Godmode 0.79 / Vanilla 0.53)
A (Forge) ships a complete, polished CodeMirror-5 editor: the screenshot renders cleanly with working tabs, line numbers, markdown highlighting, a full toolbar (theme/language/font/wrap), and a populated status bar, backed by genuine defensive code in app.js (localStorage persistence with try/catch and size caps, FileReader onerror, dirty-close confirms, beforeunload guard, DOM built via createElement/textContent). B (CodePad) is an ambitious from-scratch regex highlighter, but its screenshot shows a disqualifying layout defect — editor content and the highlight layer are clipped into a narrow left strip while two-thirds of the editor area sits blank, and its index.css ends with a leaked stray demo block (lines 443-468) the extraction script never cleaned up. B also drops content persistence across reload and its Ctrl+S only clears the modified flag rather than saving. Weighting completeness and ux, A is the one to hand to the brief author.

falling-sand

Panel verdict: Godmode (unanimous) | Δ +0.12

Brief: Make a falling sand simulation with water, fire, sand, wood, and oil that interact realistically.

Blind assignment this run: A = godmode, B = vanilla. Compare both yourself: /showcase/falling-sand.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.89	0.35	0.85	0.83	0.91	0.91	0.79
Vanilla	0.73	0.31	0.83	0.69	0.82	0.67	0.67

Per-judge rationale

Judge 1 — picked Godmode (conf 0.90; Godmode 0.77 / Vanilla 0.63)
Both implement the full brief (sand/water/fire/wood/oil with density layering, fire igniting wood+oil, water extinguishing to steam) plus smoke/steam byproducts, and both run. A is the stronger artifact: strict-mode IIFE with no global leakage, a richer interaction model (steam condensing back to water, fire color/heat gradient by lifetime, brush-cursor preview ring), full keyboard shortcuts (space/C/1-6/[ ]), an FPS readout, and a labeled three-group control panel that reads as finished in A.png. B ships a real defect: index.html links a `/inline-styles.css` that is absent from the directory, so the material color dots and hint styling never render (visible in B.png as dotless toolbar buttons), and its OIL color [107,74,30] is nearly identical to WOOD [107,67,33], making oil indistinguishable from wood on screen. B also leaks all state to global scope and has unreachable fire-life dead code in paint(). B's one edge is more realistic multi-cell liquid leveling and a bonus Stone material, but that doesn't offset the missing-asset and color-collision defects. Neither side ships any tests.
Judge 2 — picked Godmode (conf 0.83; Godmode 0.80 / Vanilla 0.69)
Both are single-file canvas falling-sand sims covering all five required materials with realistic interactions (oil floats on water via density, fire ignites wood slowly + oil fast, water extinguishes fire into steam, smoke byproducts). A (index.html) is the stronger ship: it wraps everything in an IIFE with 'use strict', uses a per-frame `moved` dirty-array to prevent reprocessing artifacts on all four move directions, adds a deeper physics cycle (steam re-condensing to water, life-based fire color), and the screenshot shows a polished titled side panel with swatches, key hints, FPS, and a brush-cursor ring plus a visible oil-on-water layer. B (index.html) is solid but barer: it leaks ~20 globals, has dead code (an unreachable `if (currentMat === FIRE)` branch in paint and an unused `up` var), no dirty-flag guard, and its seeded water pool drained off the platform to a thin floor layer in the screenshot, with a stone brazier block left looking orphaned. Neither ships tests. B does add a bonus Stone material and a persistent brazier emitter, but A's overall completeness and visual polish edge it out.
Judge 3 — picked Godmode (conf 0.86; Godmode 0.80 / Vanilla 0.70)
Both are single-file canvas sims with offscreen pixel buffers and all five brief materials interacting realistically (oil floats on water via density, fire spreads to wood/oil, water extinguishes fire into steam, sand displaces liquids). A edges ahead on correctness and polish: it carries a `moved` Uint8Array (index.html:57) that prevents the classic double-processing-per-frame bug B never guards against, gives oil a distinct purple (index.html:66) vs B where oil [107,74,30] and wood [107,67,33] are nearly identical browns (B/index.html:46-48), and ships a brush-cursor preview, FPS readout, labeled material swatches and full keyboard controls. The rendered proof is decisive: A.png shows the seeded scene actually demonstrating oil-on-water plus burning wood plus falling sand, while B.png shows only static structures, a thin water line, and no visible fire/water interaction; B also references a likely-404 /inline-styles.css (B/index.html:5) and has dead code (FIRE branch inside paint's else block).

finance-dashboard

Panel verdict: Godmode (unanimous) | Δ +0.07

Brief: Build a personal finance dashboard that imports CSV bank statements, categorizes transactions, and shows charts and trends.

Blind assignment this run: A = vanilla, B = godmode. Compare both yourself: /showcase/finance-dashboard.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.90	0.33	0.82	0.85	0.92	0.90	0.79
Vanilla	0.84	0.28	0.81	0.76	0.79	0.82	0.72

Per-judge rationale

Judge 1 — picked Godmode (conf 0.82; Godmode 0.81 / Vanilla 0.74)
Both are clean, no-build localStorage dashboards (PapaParse + Chart.js) that escape DOM injection, dedupe imports, handle debit/credit and DD/MM-vs-MM/DD dates, and auto-seed sample data. B (B/js/*.js) is materially more complete and better architected: a pub/sub store, 6 tabs, 5 charts (adds top-merchants and category-trend), pagination, multi-account filtering, budgets with progress bars, light/dark theme, JSON backup+restore (store.js exportJSON/importJSON), regex rules with negative-lookahead AU categorization (categorize.js), savings-rate KPI, and date-range presets, plus stronger defensive guards (store.load/save try/catch, save-failed toast). A's edge is a genuine column-mapping modal fallback (A/app.js openMapModal) that B lacks, but B's richer feature set and resilience win on the weighted completeness+ux criteria. Neither ships test files, so testing scores reflect only in-code verification.
Judge 2 — picked Godmode (conf 0.80; Godmode 0.78 / Vanilla 0.70)
Both are shippable, well-built single-page apps using PapaParse + Chart.js, with consistent escapeHtml usage on dynamic HTML and localStorage persistence. B is materially more complete: modular architecture (store.js pub/sub, csv.js, categorize.js with regex rules + match counts, utils.js), 6 tabs including Budgets, multi-account, light/dark theme, JSON backup/restore, pagination, 5 charts, deterministic FNV-hash IDs for clean dedupe, and more robust parsing (DR/CR + parentheses negatives in utils.parseAmount, headerless positional fallback in csv.js, "12 Mar 2025" date form). A is leaner and very readable, and ships a genuinely useful manual column-mapping modal (openMapModal) that B lacks, plus deliberate Transfer-category exclusion from income/expense math (renderStats) which B omits. Weighting completeness and ux highest, B wins; neither ships any test files, which caps both on testing.
Judge 3 — picked Godmode (conf 0.86; Godmode 0.78 / Vanilla 0.72)
Both are polished, working dark-theme dashboards with CSV import (PapaParse), keyword/rule categorization, dedup, multiple Chart.js visualizations, and consistent escapeHtml use. B (js/ modules: store/utils/csv/categorize/charts/ui/app) is materially more complete and better architected: a pub/sub store, 6 tabs (Dashboard/Transactions/Import drag-drop/Rules with live regex match counts/Budgets with progress bars/Settings with theme + JSON export-import), multi-account support, savings-rate KPI, pagination, and richer parsing (headerless positional fallback, DR/CR + parentheses amounts, "12 Mar 2025" dates, deterministic FNV-hash IDs). A is tighter and easier to audit and adds a genuinely useful manual column-mapping modal that B lacks, but covers less of the implied table-stakes surface (no budgets/settings/multi-account). Since the brief weights completeness and ux highest and B leads both without a disqualifying defect, B is the one I would hand back.

markdown-notes

Panel verdict: Godmode (unanimous) | Δ +0.11

Brief: Create a markdown note-taking app with live preview, folder organization, search, and local storage persistence.

Blind assignment this run: A = vanilla, B = godmode. Compare both yourself: /showcase/markdown-notes.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.87	0.27	0.88	0.89	0.95	0.89	0.79
Vanilla	0.84	0.20	0.83	0.61	0.77	0.80	0.68

Per-judge rationale

Judge 1 — picked Godmode (conf 0.93; Godmode 0.83 / Vanilla 0.70)
Both are safe (marked + DOMPurify), persist to localStorage, and fully cover the brief's four pillars, but B is a markedly more complete and resilient product. B adds drag-and-drop folder organization, pin, themes, export/import JSON + single .md, rotating backups with restore, v1->v2 migration, sidebar resize/collapse, inline rename, word/read-time meta, scroll sync, Tab indent/dedent, and a custom modal/toast layer replacing native prompt()/confirm() (index.html lines 243-288, 685-813), plus real failure handling (QuotaExceededError toast at line 184, createNote save-rollback at 562, isDescendant cycle guard at 303). A (index.html, 345 lines) is clean and correct but plainer: native prompts, no export/import, no error recovery beyond a load try/catch. B's only defects are cosmetic and non-functional: a dead /inline-styles.css 404 reference (line 5) and an undefined is-7a524f11 class on the empty-state keyboard hint (line 58) that loses styling only on the no-note-selected screen.
Judge 2 — picked Godmode (conf 0.90; Godmode 0.75 / Vanilla 0.64)
Both are clean single-file vanilla-JS markdown apps using marked + DOMPurify and both fully satisfy the brief (live preview, folder tree, full-text search, localStorage). B is substantially more complete and resilient: it adds export/import JSON + single-.md export, 3-slot rolling backups with restore, legacy-state migration, drag-and-drop reorg with cycle prevention (isDescendant), pin, inline rename, themes, scroll sync, Tab indent, word/read-time meta, QuotaExceededError handling and beforeunload flush — all the implied table-stakes plus polish. A is solid and focused but plainer (prompt()-based rename/folder, no export/import, no backups). B's only real defects are leftovers from its CSS-extraction step: a dead `<link href="/inline-styles.css">` (harmless 404) and an orphaned `.is-7a524f11` class in the initial empty-state markup with no matching CSS rule, plus a clunky type-a-keyword "menu" modal. Weighting completeness and UX above the rest, B wins clearly; the defects are cosmetic, not disqualifying.
Judge 3 — picked Godmode (conf 0.88; Godmode 0.80 / Vanilla 0.69)
Both are single-file vanilla-JS apps that fully satisfy the brief with DOMPurify-sanitized live preview, nested folders, content+name search, and localStorage. B is a strict superset: it adds drag-and-drop reorg with cycle guards (isDescendant), inline rename, pinning, theme toggle, export/import JSON + single .md, backup rotation, legacy v1 migration, and notably stronger resilience (saveState catches QuotaExceededError, createNote rolls back on save failure, clampName strips control chars and caps length). A is cleaner and more focused but uses blocking prompt/confirm and has an unguarded saveState. B's only defects are harmless extraction artifacts (a dead /inline-styles.css link at line 5 and an unstyled is-7a524f11 class at line 58, both on the initial empty state that is overwritten when a note auto-opens). Weighting completeness and ux, B is the one I would hand back.

particle-sandbox

Panel verdict: Godmode (unanimous) | Δ +0.14

Brief: Build an advanced particle physics sandbox that's impossible to put down.

Blind assignment this run: A = godmode, B = vanilla. Compare both yourself: /showcase/particle-sandbox.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.90	0.35	0.85	0.82	0.90	0.91	0.79
Vanilla	0.69	0.27	0.79	0.64	0.83	0.65	0.65

Per-judge rationale

Judge 1 — picked Godmode (conf 0.85; Godmode 0.79 / Vanilla 0.61)
A is a cleanly modularized (8 ES modules) physics sandbox with real Coulomb/strong-force/gravity integration (js/physics.js), a spatial hash grid, 15 persisted discoveries, full procedural Web Audio, slingshot/scroll/touch input, and a polished HUD — its screenshot shows a working, colored UI with live orbiting particles and a discovery toast firing, plus it seeds starter atoms for instant engagement (js/app.js init). B is a capable single-file falling-sand cellular automaton with 19 well-interacting elements (good Uint8/Uint32 buffer approach), but it ships as one 1100-line inline script using window globals (window.cellData/processed), loses all grid contents on window resize (resize() copy is an empty stub, index.html ~97-101), and its screenshot is a blank black canvas with uncolored white buttons because every element-color class (is-* in index.html) lives only in the missing external inline-styles.css. Neither ships tests; A's only correctness wart is the XOR cell-hash collision risk in spatial.js. Weighting completeness and ux, A is the artifact I'd hand back.
Judge 2 — picked Godmode (conf 0.80; Godmode 0.80 / Vanilla 0.67)
Both are real, working sandboxes built on different interpretations: A is a fundamental-particle physics sim (Coulomb/strong-force/gravity n-body with a spatial hash grid in spatial.js), B is a falling-sand cellular automaton (18 materials, temperature model, typed-array grid). A ships more complete and more polished: it has procedural Web Audio (audio.js), a 15-achievement discovery layer with localStorage persistence (discoveries.js), full HUD/help/settings panels, and the screenshot (A.png) shows it rendering correctly with a discovery toast firing and orbiting particles. B's simulation is deeper and the single-file engine is genuinely impressive, but the shipped artifact has a visible UX regression: the per-element button colors lived in the missing /inline-styles.css, so B.png shows a row of colorless white buttons instead of the intended color-coded palette, and B has no audio/goals/persistence to drive the "can't put it down" hook. Both reference the same stripped /inline-styles.css, but A set its particle/UI colors inline via JS so it degrades cleanly while B does not. Neither ships tests, so testing is scored on in-code defensive guards only (A's particle/effect caps and audio-context guards edge out B's).
Judge 3 — picked Godmode (conf 0.72; Godmode 0.77 / Vanilla 0.65)
Both are complete, working, genuinely fun sandboxes from the same brief but take different forms: A (js/) is a modular force-based particle physics engine (Coulomb + strong-nuclear + gravity, velocity-Verlet, spatial-hash neighbor queries in spatial.js, annihilation/fusion/decay reactions in physics.js) with procedural Web Audio (audio.js), 15 localStorage-persisted discoveries (discoveries.js + config.js), and the screenshot shows it alive with particles, effects and a firing 'Speed Demon' toast. B (index.html) is a single-file Uint8Array falling-sand cellular automaton with 19 materials and rich emergent interactions (fire/lava/acid/gunpowder/clone/void/fuse, explosions, heat transfer, plant growth) rendered fast via Uint32 putImageData. A wins on the dimensions weighted highest: cleaner architecture, an addictive progression/audio loop ('impossible to put down'), and a far more compelling shipped render. B is held back by a real defect (resize() admits its old-grid copy is a no-op stub, so resizing wipes the whole world), pervasive window.* globals instead of encapsulated state, a GAS-explosion inner-loop `continue` that doesn't break out cleanly, and a screenshot showing a blank canvas with colorless toolbar buttons because the material colors live in an external is-* stylesheet that isn't shipped in B/. Neither ships any test files.

pixel-art-editor

Panel verdict: Godmode (unanimous) | Δ +0.19

Brief: Create a pixel art editor with layers, custom color palettes, animation frames, and PNG export.

Blind assignment this run: A = vanilla, B = godmode. Compare both yourself: /showcase/pixel-art-editor.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.92	0.86	0.82	0.83	0.92	0.89	0.87
Vanilla	0.83	0.17	0.78	0.66	0.83	0.84	0.68

Per-judge rationale

Judge 1 — picked Godmode (conf 0.83; Godmode 0.89 / Vanilla 0.68)
Both ship complete, working pixel editors covering the brief (layers, palettes, frames, PNG export). B wins on the dimensions that matter most here: it ships a real test harness (B/js/tests.js runs 12+ assertions on color conversion, compositing, flood-fill edge cases, and serialize round-trip including the >32KB chunked-base64 path) where A ships none; it adds true per-layer alpha compositing with an opacity slider (state.js compositeFrame), Save/Load project JSON with defensive deserialize and try/catch+toast error paths (export.js, app.js), and a clean 13-module ES architecture vs A's single inline script. The B screenshot proves the full pipeline renders an actual sprite end-to-end, while A's renders an empty canvas. A's edge is breadth of drawing tools (it has dedicated circle and move tools plus layer merge-down that B lacks) and tight packed-Uint32Array buffers, but A also has a dead stylesheet link (index.html line 5 references a missing /inline-styles.css) and no validation harness.
Judge 2 — picked Godmode (conf 0.82; Godmode 0.87 / Vanilla 0.70)
Both are complete, working pixel editors that nail the brief's four pillars (layers, custom palettes, animation frames, PNG export), but B is the stronger artifact on the weighted dimensions. B (js/export.js, js/state.js) ships genuine alpha compositing with per-layer opacity, project save/load via chunked base64 (state.js bytesToBase64 explicitly avoids the fromCharCode stack-overflow footgun), an export modal with scale + 3 modes + transparent toggle, a live preview panel, resize/new modals, toasts and a coordinate readout — and is the only side with a real test harness (js/tests.js, 15 assertions covering compositing/opacity, flood-fill edge cases, and a 1MB serialize stress test). A (index.html, single file) is also polished and actually has a richer raw toolset (line/rect/circle/move/picker), packed-Uint32 buffers, and strong mobile CSS, but it lacks save/load, per-layer opacity, and any tests, uses last-opaque-wins compositing, and references a non-existent /inline-styles.css. The screenshot confirms B renders a finished mushroom sprite with two frames and live preview, while A shows an empty canvas; weighting completeness and ux, B is the one I'd hand back.
Judge 3 — picked Godmode (conf 0.82; Godmode 0.86 / Vanilla 0.67)
Both are genuinely complete pixel editors covering layers, custom palettes, animation frames, and PNG export, but B is the stronger artifact. B ships a real self-test harness (js/tests.js, 13 assertions over color/composite/floodFill/serialize incl. a 1MB chunked-base64 case) run on boot, plus implied table-stakes A lacks: project save/load (js/export.js + state.js serialize/deserialize), per-layer opacity with true alpha compositing (state.js compositeFrame), an export modal with scale + sprite-sheet/each-frame modes, a live preview panel, fit-to-view, and a clean ES-module structure with try/catch around export and load. A is a polished single-file build with extra tools (circle, move, merge-layer) and good code, but ships zero tests, only binary layer visibility (no opacity blend), no save/load, and thinner failure handling. Weighting completeness and ux, B is the one I would hand back.

pomodoro-timer

Panel verdict: Godmode (unanimous) | Δ +0.19

Brief: Build a Pomodoro timer with customizable intervals, session history, daily stats, and notification sounds.

Blind assignment this run: A = godmode, B = vanilla. Compare both yourself: /showcase/pomodoro-timer.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.93	0.48	0.89	0.89	0.95	0.90	0.84
Vanilla	0.78	0.27	0.69	0.61	0.71	0.83	0.65

Per-judge rationale

Judge 1 — picked Godmode (conf 0.88; Godmode 0.85 / Vanilla 0.66)
A separates a pure, CommonJS-exported logic module (timer-logic.js) from DOM glue (app.js), implements the full brief including a 7-day daily-stats bar chart with tooltips, volume + a sound-test button, desktop-notification permission flow, and ships real resilience: localStorage try/catch with QuotaExceeded history-trimming (app.js saveJSON), history-record validation on load, settings sanitization/clamping, idle-resume on reload, day-rollover detection, and a double-fire completion guard. B is a clean, well-rendered single file but has a genuinely wrong streak (renderStats counts consecutive focus entries from the top of history with no day awareness, so a break resets it and it never reflects calendar days), omits the per-day/daily-stats chart that "daily stats" implies (only a today's summary), uses innerHTML in renderHistory (a DOM footgun even if current inputs are low-risk), and does not persist the session counter across reload. Both ship zero test files, so testing is scored on in-code defensiveness, where A is far stronger; A's only notable flaw is a comment in closeSettings claiming it preserves progress ratio on duration change while it just snaps totalMs.
Judge 2 — picked Godmode (conf 0.88; Godmode 0.85 / Vanilla 0.67)
A ships a 3-module architecture with a pure, side-effect-free timer-logic.js (dual-exported for Node testing), drift-proof epoch-delta timing, localStorage quota handling with history trim+retry, settings sanitization/clamping, double-fire completion guards, and superset features (7-day chart with tooltips, lifetime totals, cycle dots, volume + Test-sound, day-rollover detection) all rendered via safe createElement/textContent. B is a clean, attractive single-file build that covers the core brief but has a genuinely broken metric: renderStats counts consecutive focus history entries (which a single break resets) and labels it "Current Streak" rather than counting distinct days, plus it does no load-time validation of stored settings, so a corrupt focus value yields total=0 and a NaN progress bar / non-counting timer. A wins completeness decisively and is at least equal on UX with no disqualifying defect, so it's what I'd hand back.
Judge 3 — picked Godmode (conf 0.86; Godmode 0.82 / Vanilla 0.61)
A ships a markedly more complete and robust implementation: a pure, dual-exported timer-logic.js with NaN-guarded sanitizeSettings/clampNumber, wall-clock elapsed tracking plus a setTimeout completion fallback for throttled background tabs, all DOM built via createElement/textContent, localStorage wrapped in try/catch with a QuotaExceeded trim-and-retry, plus a 7-day bar chart, total stats, volume control, test-sound, auto-start toggles, and a correct date-based streak (aggregateStats in timer-logic.js:188). B (a single ~190-line inline script) is clean and renders well but is thinner on the brief and has a real defect: its streak counts consecutive head-of-history focus records regardless of date (index.html:218-220), so the daily-stat is wrong, and it uses innerHTML for history rendering. Neither ships an actual test file, but A's pure logic module is explicitly built for one. Weighting completeness and ux, A is the one I'd hand back.

ray-tracer

Panel verdict: Godmode (unanimous) | Δ +0.03

Brief: Build a real-time ray tracer in the browser that renders spheres with reflections, shadows, and adjustable lighting.

Blind assignment this run: A = vanilla, B = godmode. Compare both yourself: /showcase/ray-tracer.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.85	0.27	0.84	0.72	0.91	0.89	0.74
Vanilla	0.82	0.27	0.85	0.70	0.81	0.83	0.71

Per-judge rationale

Judge 1 — picked Godmode (conf 0.88; Godmode 0.77 / Vanilla 0.74)
A and B share an identical baseline (same CSS, HTML scaffold, JS render loop, sphere/plane intersection, Reinhard tone map), so this is decided by B's incremental feature work. B adds a "Light color" control wired to a uLightColor uniform (index.html:19, shade() at index.html:157-160), a visible emissive light source via lightHit() that appears in-scene and in reflections (B index.html:128-137, 187-196; the bright dot is visible in B.png), Fresnel-Schlick reflectance for physically-grounded glancing reflections (B index.html:204-209) versus A's flat h.refl multiply, plus more bounces (8 vs 5) and light-radius-aware shadows. B also cleaned up A's dead variables (the unused `float prev`/`prevT`/`idx` left in A's trace/shadow loops). Both ship zero tests and identical defensive guards (WebGL2 fallback message, shader compile/link checks), so testing/security/errors are a wash; completeness and ux, the weighted dimensions, go to B.
Judge 2 — picked Godmode (conf 0.90; Godmode 0.76 / Vanilla 0.73)
B is a strict superset of A: index.css is byte-identical and the camera/UI/render-loop JS is the same, but B's fragment shader adds three meaningful upgrades that A lacks: an adjustable light-color picker (uLightColor + <input type=color>, directly serving the brief's "adjustable lighting"), an emissive light source that is visible to the camera and in reflections (the bright glow on the red sphere in B.png that is absent in A.png), and Fresnel-Schlick reflectance for physically convincing glancing-angle reflections vs A's flat constant reflectivity. B also raises the bounce ceiling (max 8 vs 5) and makes the shadow ray light-radius-aware (dist - LIGHT_RADIUS). Both implementations fully satisfy the core brief (spheres, reflections, shadows, adjustable lighting), render correctly, guard WebGL2 absence and shader compile/link, and ship zero tests, so completeness and ux are the deciding axes and both favor B.
Judge 3 — picked Godmode (conf 0.86; Godmode 0.70 / Vanilla 0.67)
B is a strict superset of A: same solid WebGL2 fragment-shader ray tracer base, but B adds an adjustable light color picker (uLightColor wired through shade()), an emissive visible light source rendered directly and in reflections (lightHit(), visible as the warm dot in B.png), and physically-based Fresnel-Schlick reflectance (index.html:204-209) versus A's flat reflectivity. B also tightens shadow correctness with dist - LIGHT_RADIUS (index.html:152) and removes the dead prev/prevT/idx locals that A still carries in trace()/shadow() (A/index.html:95,99,122). Both render correctly and meet the brief (spheres, reflections, shadows, adjustable lighting), so the gap is incremental rather than disqualifying, but B is the one I'd hand back: it more fully satisfies "adjustable lighting" and looks richer.

roguelike-dungeon

Panel verdict: Godmode (unanimous) | Δ +0.14

Brief: Build a roguelike dungeon crawler with procedural generation, turn-based combat, inventory, and permadeath.

Blind assignment this run: A = vanilla, B = godmode. Compare both yourself: /showcase/roguelike-dungeon.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.92	0.53	0.84	0.86	0.88	0.82	0.81
Vanilla	0.71	0.34	0.78	0.63	0.85	0.72	0.67

Per-judge rationale

Judge 1 — picked Godmode (conf 0.70; Godmode 0.83 / Vanilla 0.72)
Both ship fully working roguelikes hitting all four brief features (procgen rooms+corridors, shadowcasting FOV, bump combat, inventory, permadeath, Amulet win). A (single 1093-line game.js + canvas tiles) edges UX with floating damage text, on-map monster HP bars, and a clickable HTML inventory, but uses one big global state object, unseeded Math.random, and minimal guards. B (12 modular ES files) is markedly stronger on engineering: seeded deterministic mulberry32 RNG with getState/setState, frozen enums, Uint8Array maps, pure-logic/DOM separation, JSDoc throughout, and real defensive invariants that throw (rng.int/weighted, generateDungeon depth range, computeFov, makeMonster/makeItem) — the closest thing to shipped verification absent any test files. B also out-features A with hunger, true ascend/descend multi-level travel, drop, confusion, and a NetHack-style carry-amulet-to-surface win. B's one concrete bug: scroll kills add target.xp directly (items.js L123/L145), bypassing gainXp so they never trigger level-ups. Weighting completeness+ux highest, A's ux lead is small while B leads on completeness and dominates code/errors, so B is the artifact I'd hand back.
Judge 2 — picked Godmode (conf 0.82; Godmode 0.80 / Vanilla 0.68)
Both are genuinely complete, playable roguelikes covering all four brief pillars (procgen rooms+corridors, shadowcast FOV, bump-to-attack turn combat, inventory/equip, permadeath with restart). B (B/src/*) is the stronger artifact: 13 cleanly separated ES modules with a seeded mulberry32 RNG (rng.js — reproducible runs, seed shown in HUD), flat Uint8Array maps, single-source constants, heavy JSDoc, and defensive throws (RangeError/TypeError in rng/dungeon/fov/items), and its rendered screenshot is notably more polished (full sidebar with Depth/Turn/HP bar/Hunger/Amulet/Seed plus an always-visible controls hint). A (A/game.js) is a competent but globally-coupled 1094-line monolith using unseeded Math.random, has dead code in generateFloor (both branches of the explored-grid if/else are identical, lines 145-146), and its screenshot renders a sparse single-room view that reads as less finished, though it does add a DOM HUD/log and real mobile-responsive CSS that B lacks. Weighting completeness and UX highest, B edges completeness (hunger/starvation, four functional scrolls incl. monster confusion AI, ascend-to-win) and clearly wins code quality and error resilience; B's only real gaps are needing to be served over HTTP (ES modules won't load via file://, though the screenshot confirms it renders) and scroll kills bypassing gainXp's level-up handling (items.js applies player.xp += directly).
Judge 3 — picked Godmode (conf 0.82; Godmode 0.79 / Vanilla 0.61)
Both ship a working browser roguelike with procgen, shadowcasting FOV, bump combat, inventory, leveling, 10 floors and a permadeath win/lose loop, but B is the stronger artifact on the weighted dimensions. B's modular ES6 architecture (13 files: pure game.js with zero DOM, seeded mulberry32 RNG with getState/setState, frozen constants as single source of truth, JSDoc and RangeError/TypeError guards in rng.js/fov.js/dungeon.js/entities.js) is clearly more maintainable and defensive than A's single 1090-line global-mutating game.js, and B adds genuine table-stakes roguelike mechanics A lacks (hunger/starvation, ascend, carry-the-amulet-to-surface win, reproducible seeds shown in the HUD). The rendered evidence seals it: A.png is mostly empty black canvas with one tiny lit room and a wasted viewport, while B.png shows a fully realized ASCII dungeon plus a rich sidebar HUD (Depth/Turn/Level/XP/HP bar/equipment/hunger/amulet/seed) and a color-coded log, reading as a far more complete and polished result. A's redeeming edges are floating combat-text animations, monster HP bars, gold, and atk/def buffs, but neither side ships tests and A's static frame undersells it.

synth-drum-machine

Panel verdict: Godmode (unanimous) | Δ +0.14

Brief: Build a web-based synthesizer and drum machine with a step sequencer, multiple waveforms, and effects.

Blind assignment this run: A = godmode, B = vanilla. Compare both yourself: /showcase/synth-drum-machine.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.91	0.29	0.85	0.84	0.94	0.92	0.79
Vanilla	0.79	0.22	0.81	0.57	0.75	0.74	0.65

Per-judge rationale

Judge 1 — picked Godmode (conf 0.90; Godmode 0.80 / Vanilla 0.65)
A ships a markedly more complete instrument: real polyphony with noteOn/noteOff, a 2-oscillator voice with mix/detune, both amp ADSR and a filter envelope (js/synth.js), stuck-note prevention and a panic() voice killer, 4 selectable presets, swing, a live oscilloscope, a proper on-screen piano with black/white keys, per-row audition buttons, and clamped/guarded audio setters plus an init try/catch with an error banner (js/ui.js, js/audio.js). The rendered screenshot (A.png) shows a polished UI with a preset already lit. B (single index.html) is solid and readable and uniquely adds a scale/root-note system, but its synth is monophonic fixed-duration one-shots with no note-off, it has no presets/swing/visualizer, its ADSR range callbacks are no-ops (v=>{} at lines 579-582), drums bypass the filter/distortion/delay chain (lines 256-257), and its screenshot (B.png) shows an empty grid with the synth-track labels overflowing past the keyboard area. Neither ships test files, so both score low on testing; A's defensive guards and clamping give it the resilience edge.
Judge 3 — picked Godmode (conf 0.86; Godmode 0.82 / Vanilla 0.67)
A is the stronger ship: a clean six-module architecture (audio/synth/drums/sequencer/presets/ui) with a true polyphonic voice manager (active-note Map, note stealing, noteOff, panic), dual oscillators with mix/detune, a filter envelope on top of ADSR, plus UX extras B lacks entirely (built-in presets, per-row audition buttons, scroll-to-pitch with note names, a live oscilloscope, and a real white/black piano keyboard with octave shift). B is genuinely good and has one conceptual edge (root-note + 5-scale degree system, bass track), but its synth is monophonic-per-trigger with no real note-off for held keys, its "keyboard" is 8 buttons, it has no presets/scope, and it is far less defensive (A clamps every param, guards every node, wraps init in try/catch with an error banner, and cleans up voices; B relies on terse `node && (...)` guards and ships no default pattern, so it renders empty). Neither side ships any test files, so testing scores low for both on defensive-verification grounds where A still leads.
Judge 2 — picked Godmode (conf 0.86; Godmode 0.76 / Vanilla 0.63)
A is the more complete and polished build: it ships per-step pitch editing on the synth row (js/ui.js onSynthWheel + note-name display), a preset library with BPM (js/presets.js: Boom Bap/Four-on-Floor/Breakbeat), swing, a fuller master FX chain (distortion → filter → delay-feedback → convolver reverb → compressor → analyser), a real on-screen piano with positioned black keys plus computer-keyboard mapping and octave shift, an oscilloscope, and an error banner with try/catch init (js/ui.js DOMContentLoaded). B is genuinely solid and musically clever (scale/root-note system mapping 4 synth tracks + bass to scale degrees, correct lookahead scheduler), and its single-file structure is clean, but it lacks per-step pitch control, presets, a visualizer, a true keyboard (8 buttons only), swing, and any error/resilience handling. Both correctly use the lookahead scheduling pattern and have no real injection surface (createElement/textContent, no user HTML), so the gap is completeness and UX, which the brief weights highest.

tetris

Panel verdict: Godmode (unanimous) | Δ +0.03

Brief: Create a polished Tetris clone with hold piece, ghost piece, next queue, scoring, levels, and a local leaderboard.

Blind assignment this run: A = godmode, B = vanilla. Compare both yourself: /showcase/tetris.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.84	0.27	0.85	0.81	0.93	0.89	0.77
Vanilla	0.86	0.27	0.82	0.82	0.84	0.82	0.74

Per-judge rationale

Judge 2 — picked Godmode (conf 0.78; Godmode 0.76 / Vanilla 0.72)
Both are complete, correct single-file Tetris clones with 7-bag, SRS kicks, ghost, hold, 5-piece next queue, level curve, lock delay, and a localStorage leaderboard with HTML-escaped names. A wins on the weighted axes (completeness + ux): A ships real touch/mobile support (index.html L53-680: on-screen buttons, pointer events, DAS auto-repeat, responsive canvas resizing) so it is playable on phones, while B has responsive CSS but zero touch controls (B/index.html), leaving it unplayable without a keyboard despite shrinking the board. B also ships a dangling `<link href="/inline-styles.css">` (B/index.html L5) to a file absent from the directory, plus extraction-artifact classes (is-f2fecb34, is-6e22c58a) with no backing rules. B's edges are nicer desktop polish (start screen, Tetris!/level toasts, persistent Best score) and a lock-reset cap (lockResets < 15) that prevents the infinite-spin stall A allows; not enough to overcome the missing asset and absent touch play.
Judge 3 — picked Godmode (conf 0.72; Godmode 0.78 / Vanilla 0.77)
Both are complete single-file canvas Tetris clones with 7-bag randomizer, SRS wall kicks, hold, ghost, 5-piece next queue, standard line scoring (100/300/500/800 x level), level curve, lock delay, DAS/ARR, and an XSS-safe localStorage leaderboard (both escape names). A wins on completeness/UX: it ships a full on-screen touch control panel with DAS-style repeat (real mobile playability, index.html lines 53-61, 637-680) plus an always-visible controls reference, and its screenshot confirms a polished active game with a working ghost piece; B's mobile layout is responsive but has no touch input, so a touchscreen user cannot actually play it, and B ships a dead `/inline-styles.css` link (index.html line 5, file absent). B has the cleaner, more idiomatic code (matrix rotation, lockResets<15 infinity guard at lines 230-236, toasts, separate best-score), and A's unconditional lockTimer reset on every move/rotate permits indefinite stalling. Net: B is the better-engineered core, but A is the more complete shippable product against the brief's implied mobile table stakes.
Judge 1 — picked Godmode (conf 0.70; Godmode 0.76 / Vanilla 0.73)
Both are complete, correct single-file Tetris clones with all six required features, full SRS kick tables, DAS/ARR, lock delay, HTML-escaped localStorage leaderboards, and clean dark-themed layouts. A wins on completeness and demonstrated polish: it ships working mobile touch controls (index.html lines 53-61, 637-680) and an always-visible controls panel, its screenshot shows a fully live game (active piece, ghost outline, 5-deep colored next queue), and it wraps both localStorage read AND write in try/catch (lines 560-566). B is slightly more elegant in code (rotation-matrix approach plus a capped 15-lock-reset that is more correct modern SRS behavior, lines 230-236) and adds Best-score/toasts/start-menu, but it has no mobile controls, leaves saveLB/setBest unguarded against quota/private-mode throws (lines 461,465), and its screenshot only shows the start menu rather than gameplay.

tower-defense

Panel verdict: Godmode (unanimous) | Δ +0.09

Brief: Make a tower defense game with multiple tower types, upgrade paths, enemy waves, and a map editor.

Blind assignment this run: A = vanilla, B = godmode. Compare both yourself: /showcase/tower-defense.html

Mean of 3 judges	Code	Testing	Security	Errors	Complete	UX	Composite
Godmode ▸ winner	0.88	0.27	0.84	0.79	0.92	0.87	0.76
Vanilla	0.77	0.18	0.84	0.63	0.79	0.82	0.67

Per-judge rationale

Judge 2 — picked Godmode (conf 0.86; Godmode 0.81 / Vanilla 0.69)
Both are single-file canvas tower-defense games that hit every brief item (multiple tower types, upgrades, waves, map editor), but B is the more finished product. B uses class-based entities holding direct object references (Enemy/Tower/Projectile in B/index.html lines 406-720), giving it branching upgrade trees with two level-3 specializations per tower, 4 preset maps, localStorage save slots with serialize/deserialize (lines 1510-1528), a 30-wave win condition, victory/defeat modals, pause, and a map-select screen. A is solid and ships a live auto-rendered board, but carries a latent index-aliasing bug: projectiles store `targetId: enemies.indexOf(target)` (A line 388) while `updateEnemies` reassigns `enemies = enemies.filter(...)` every frame (lines 669-670), so in-flight projectiles can resolve against the wrong enemy after a kill — and A's upgrade paths are linear-only with no persistence. Neither ships tests; B at least exposes a `window.TD` debug hook.
Judge 1 — picked Godmode (conf 0.82; Godmode 0.75 / Vanilla 0.69)
Both are complete, single-file canvas tower-defense games that render correctly and cover every brief item (5 tower types, upgrades, waves, map editor). B is the stronger artifact: it ships branching upgrade paths (each tower's level 3 forks into two specializations like Ranger/Crossbow and Mortar/Demolisher in index.html ~L141-219), four preset maps plus a map-select screen, localStorage-backed 5-slot map persistence with try/catch (saveSlot/loadSlot ~L1518-1528), pause, finite 30-wave victory/defeat modals, and a more robust projectile model that holds direct enemy object references. A is cleaner to read and has elegant infinite-wave scaling, but its projectiles target enemies by mutable array index (createProjectile uses enemies.indexOf and updateProjectiles reads enemies[p.targetId] after the array is reassigned via filter, index drift bug ~L386/547), has no pause and no map persistence, and a dead double-filter at L669-670. B's only real defect is a dangling /inline-styles.css link (cosmetic .is-* spacer classes go unstyled; core UI renders fine per screenshot).
Judge 3 — picked Godmode (conf 0.84; Godmode 0.73 / Vanilla 0.64)
B is the more complete and more correct artifact: 5 towers with 3 levels PLUS two level-3 specialization branches (pierce/multishot/crit/freeze/railgun), 6 enemies, a finite 30-wave campaign with victory/defeat modals, 4 preset maps, and a map editor with 5 localStorage save/load slots and live BFS path validation (index.html lines 129-220, 491-637, 1510-1557). It uses object-reference targeting in Tower/Projectile, which is correct. A is solid and clean but carries a real combat bug: projectiles store enemies by array index (createProjectile `targetId: enemies.indexOf(target)`, read as `enemies[p.targetId]`), while updateEnemies filters the enemies array every frame, so indices go stale and projectiles/chain-lightning can hit the wrong enemy or whiff after any death (lines 384-400, 545-606, 669-670); A's editor also has no persistence and the game is endless with no win state. B's only notable defect is a missing `/inline-styles.css` referenced in its head, leaving a handful of `is-XXXX` spacer/button classes unstyled, but its primary index.css is complete and the screenshot confirms it renders and plays correctly. Neither ships tests.

Method & honesty notes

Both implementations in each pair were built earlier from the same one-line brief, same model (Claude Opus 4.6), same environment — the only variable is the execution protocol (vanilla Claude Code vs the Godmode skill).
Judges were fresh Claude Opus 4.8 agents. This is a blind review (judge does not know which tool produced what), not a cross-vendor one: producer and judge are both Claude. We are not claiming an outside lab graded this; we are claiming the grader could not see the labels.
The grader is a capable LLM, not a human panel. Treat these as automated structured judgments, useful precisely because they are reproducible and label-blind — not as a substitute for trying the tools yourself.
A/B slot assignment and presentation order are deterministic from the seed, so anyone can reproduce the exact same blinding.
Nothing here is curated. Pairs where vanilla won or tied are listed with the same detail as pairs Godmode won.

Reproduce it

The full machinery is in the repo under scripts/independent-review/: blind-stage.js (strips labels, randomizes A/B, screenshots each staged copy, fails loudly if any brand token survives), judge-prompt.md (the verbatim prompt every judge received), and build-review.js (this aggregator). The raw machine-readable verdicts — every judge's per-dimension scores, winner pick, confidence, and rationale — are at /showcase/data/independent-review.json. Seed: gm-independent-review-2026-06-13.

// see it on your own work

The judges could not see the labels and still picked Godmode in 14 of 14.

Get Godmode → Try Godmode Lite free

← back to showcase