The gate that caught us
I called the page done. Then I spun the mouse wheel and nothing moved. The native scrollbar still dragged the page, so the content was there, but the wheel did nothing and two pinned sections were lying on top of each other like a bad print registration. I'd shipped that. I'd run a gate over it, watched it go green, and written the word "complete" in a commit message. The operator found the regression in about a minute of actually using the thing.
This is the honest version of that arc. Most "built with AI" writeups would turn it into a tidy story about an AI building a website. This one's messier and truer: an agent that kept confidently naming the wrong cause, a human who kept catching it, and a methodology that got mechanically harder to fool every time the agent was honest about being wrong. The agent's quoted by name throughout, straight from the build's own log, because the orchestration is the point. Sand the name off, write "my AI assistant," and you lose the only part that proves anything.
The win that wasn't
Before the bad refactor there was a good one. We captured a spec for a new hero treatment, ratified it, and shipped iteration one: a real per-channel refraction shader, still live on the site today. Then I ran Lighthouse. The performance score read 37 on an earlier pass and 73 after the shader work. A 35-point jump.
I wanted that to be a win. The agent wouldn't let me have it:
Do not over-read 37 to 73. A 35-point swing on a marginal code delta is implausible as a real gain. Treat this as a clean fresh baseline that locates the bottleneck, not a before/after delta.
Claude Code, iteration-1 measurement, build log
That instinct, distrust a number that flatters you, is the whole post in one paragraph. I didn't fully learn it here. I learned it three more times, expensively, over the next few days.
Four changes in one phase
The next phase was structural. One commit carried four independent architectural changes: deferring GSAP's scroll engine past first paint, introducing a single persistent WebGL canvas spanning two scenes, migrating a background blob from SVG to the 3D layer, and stripping opaque backgrounds off two sections so the canvas could show through.
Lighthouse looked great afterward. Desktop performance up ten points, total blocking time down 48 percent. I shipped it behind a two-leg gate: the server-rendered markup was intact, the performance numbers were measured. Both legs green. I wrote "complete."
Neither leg had checked whether you could actually scroll the page.
The minute it took to catch
The wheel was dead because of a deletion I'd made and described, in a previous session's notes, as removing a "duplicate." It was not a duplicate. It was the only thing reliably driving the scroll loop. The gate I'd run couldn't have caught this, because the gate didn't include a human, or anything like a human, using the page.
Phase 1 shipped a critical functional regression that the two-leg gate (server-rendered markup plus perf) missed, and I wrongly declared it complete. The operator caught it.
Claude Code, hotfix-1 entry, build log
The fix for the process was bigger than the fix for the bug. The gate became three legs, permanently: structure, then function, then performance, and the functional leg gates the performance leg. You don't get to report a pretty Lighthouse number until something has proven the page still works. That rule has a commit hash and a date. It exists because of this exact failure.
Wrong three times, the same way
Fixing it took three rounds, and the agent was confidently wrong in the same shape each time.
Round one surfaced the real lesson behind the dead wheel:
A note in the log is not ground truth. Any change that relies on a prior logged statement must first verify that statement against the code. The "duplicate" was never tested. It was load-bearing.
Claude Code, hotfix-1 lessons, build log
Round two re-diagnosed a leftover bug as "structural," a stacking and opacity problem from the background removal. The two-hotfix budget (a hard rule: you don't get a third consecutive remediation attempt before re-scoping) fired and stopped the chain. Good. Except the structural diagnosis was also wrong, and we only found that out by acting on it. We did a careful selective revert of exactly the structure the diagnosis blamed. The bug didn't move a pixel.
Reverting exactly that structure did not change the bleed at all. The operator ratified that revert against my diagnosis. I led them to a revert that does not fix the problem. This is the third diagnosis-as-ground-truth failure this arc.
Claude Code, selective-revert disproof, build log
Three times the agent named a cause with confidence. Three times the cause was an assertion wearing the costume of a finding. Each one survived exactly until something measured it.
The control experiment
The thing that should've happened on day one happened on day four. Instead of arguing about which change caused the bleed, we built the known-good baseline (the pristine commit from before any of this) and ran it through the same browser harness. Pristine: zero non-adjacent overlaps. The broken state: twelve. That's not an argument, that's a measurement, and it ended the debate in one run. The deferred-scroll change was the cause, established by experiment, not by anyone's conviction.
Then the same discipline took back the thing I'd been most proud of:
The desktop ten-point, 48-percent win was measurement-methodology noise, not a real gain. Reverting it cost no actual performance. The early unpinned Lighthouse runs manufactured a phantom win that drove three rounds of risky remediation.
Claude Code, full-revert retrospective, build log
Read that again. The number that justified the entire risky phase was never real. An unpinned, unstandardized measurement invented a win, and that win cost four days. We reverted everything to pristine and kept the shader. The fanciest work in the arc shipped nothing. The boring rule, pin the measurement, would've saved all of it.
What the failures built
I don't have abstract principles to offer you, just a list of rules, each with a specific failure chained to its ankle.
The three-leg gate exists because a two-leg gate shipped a page you couldn't scroll. The two-hotfix budget exists because without it I'd have run fix 1.6, 1.7, 1.8 forever. The rule that a phase contains exactly one architectural change exists because four changes in one commit meant that when it broke nothing could be isolated, and a control experiment was the only way out. The pinned-Lighthouse rule exists because an unpinned one hallucinated a win. And the rule that you measure the control before you ratify a revert exists because we kept not doing it:
Stop ratifying reverts against asserted causes. The control experiment should have been step one, not step four.
Claude Code, escalated lesson, build log
It didn't stop there. A later arc found the mobile performance score was bimodal: the same build, measured twice back to back, would return roughly 58 then 68 with no code change at all. So the gate moved off the score and onto total blocking time, which is stable, plus a median of several runs for anything score-shaped. Another arc found a subtler trap: a grep or a chunk-name match would report a clean signal that a real measurement then contradicted. So now any check that leans on string matching has to be paired with a semantic one. The surface signal proposes; the measurement disposes.
And the most recent one is the reason this post exists at all. While building the surface you're reading this on, the agent closed a milestone, didn't stop for the visual review it was supposed to stop for, and started building the next thing on top of unconfirmed work. I caught it by hand, again. The response wasn't "be more careful." The response was a pre-commit hook that writes a halt marker and refuses to let the next commit through until the human removes it. The one class of mistake that had been caught only by a person watching is now caught by the repository itself.
That's the pattern under all of it. Every honest near-miss got converted into a mechanism. The methodology can only get harder to fool now, never easier, because the cost of admitting a near-miss is one halt and one honest log line, and the cost of hiding one is that it recurs forever on top of an audit trail everyone now trusts and that's quietly false. Cheap to admit. Ruinous to bury. That asymmetry, kept visible on purpose, is the actual product.
If your AI coding workflow keeps catching near-misses too late, or doesn't catch them at all, I can help. Send the workflow you're using now, the kind of work it's producing, and where it tends to drift. VibeKoded can scope a spec discipline install, gate configuration, or operator handoff. → Work with VibeKoded