Why AI generated code does not work

// pre-launch// field-notes8 min read

Your AI wrote code. The code does not work. You ran it, watched it, and either it didn't do what you asked, or it did the wrong thing, or it crashed, or it looked done but wasn't. You're not crazy. The failures have shapes.

I've shipped enough AI-built work, and pulled apart enough that other people sent me broken, to recognize the same patterns showing up again and again. Five of them, mostly. Knowing which one you're staring at is half the rescue. The other half is what to do about it, and most of the time the right move is smaller than you think.

This is the diagnostic map. Each failure mode comes with the question that surfaces it, the reason it happens, and what to do once you know which one you have.

Failure mode 1: Missing context

The AI built from memory of how code should look, not from your project's actual files.

Diagnostic question: did the AI read the files it was supposed to change, or did it pattern-match from training data?

This is the most common shape, and it's almost always silent. The output looks fine. The syntax is correct. The logic, in isolation, is reasonable. But it doesn't actually fit your codebase. Imports point at things that don't exist. Function names assume the rest of your code calls them differently. The styling references a design system the AI was guessing at.

I caught one of these mid-build on the /log nav, where the AI proposed a fix for a hidden navigation element without reading how my smooth-scroll library and animation framework were interacting under the hood. The fix made the symptom go away. It also created a coordination bug between two libraries that took three rounds to find. The methodology for that arc is documented in the-nav-that-almost-wasnt. Short version: the AI was building from its idea of how scroll handlers usually work, not from how mine were actually structured.

The fix is process, not code. Make the AI read the file it's about to change. If it has tools for that, force them. If it doesn't, paste the file in and tell it to wait. Pattern-matching the right answer is faster than reading the actual code, and that's exactly the problem.

Failure mode 2: False completion

The AI declared the work done before it actually worked.

Diagnostic question: did the AI verify the change end-to-end, or did it stop when the file write succeeded?

This is the one that makes you furious. The AI says "Done!" with a checkmark. You go to test it. Nothing happens. The change exists. The file is updated. The build runs. But the user-visible behavior didn't change. Either because the test never actually ran, or because the test that ran was checking something different than what you cared about, or because the AI never even bothered to check at all.

I shipped one of these myself, on the page you're probably looking at parts of right now. Four architectural changes in one commit, two gates passed (markup intact, performance measured), and one important gate missing: anyone tried to actually scroll the page. They couldn't. The mouse wheel was dead. I wrote "complete" in the commit message anyway. The arc is in the-gate-that-caught-us, and the methodology that came out of it is the three-leg gate (structure, then function, then performance), where the functional leg gates the performance leg. Pretty numbers don't ship until something has verified the page still works.

The fix is to refuse to accept "done" as evidence. Make completion mean "I ran the thing, I saw the new behavior, here's the proof." If the AI can't provide proof, the work isn't finished. It's just written.

Failure mode 3: Stale state

The change exists in the file but isn't running.

Diagnostic question: where exactly is the change you're testing, the file, the build artifact, or the live deployed bundle?

This one is sneaky because everything looks correct. The code in your editor shows the new version. Git diff shows the change. The AI confirms the edit happened. But the running app shows the old behavior, because what's running isn't what's in the file. Build cache, deployment lag, hot-reload skip, framework-specific compilation step that didn't trigger. There's a gap between "I changed this" and "the system reflects the change."

I hit this during a promotion cycle on this very site. Stage one of the gate ran fine. The diff said the change was committed. But the live server was still serving the old build, because the build artifact had already been compiled and the dev process didn't restart. We caught it via measurement, not via the gate's grep checks, which is why surface signals (greps, file existence, build output) need to be paired with semantic signals (does the actual rendered page show what we expected). The principle is laid out in two-bugs-one-symptom, and it's a different shape of the same lesson: the diagnostic question is always "what am I actually looking at."

The fix is to never trust intermediate evidence when the question is "is this live." Hit the rendered URL. Read the response. Verify the new behavior is in the actual output a user would see.

Failure mode 4: Untested change

The AI changed something it claimed was unrelated, and it broke something else.

Diagnostic question: did existing tests run after the change?

This one comes from agents getting overconfident about scope. They were told to fix X. They noticed Y looked weird while they were in there. They fixed Y too. They mentioned it in passing in the response. Y was actually load-bearing in a way they didn't catch. Now X is fixed and Y is broken, and your test for the X feature passes, and you have no idea Y exists until something downstream of Y starts failing for what looks like an unrelated reason.

The shape repeats in build work all the time. An agent removes something it labels as "duplicate code" or "unused import" or "redundant check," and the thing was load-bearing exactly because it didn't look like it was doing anything. The mechanical defense against this is to make agents change less per pass, and to run the existing test suite after every change instead of only after the planned changes.

The fix is scope discipline. One thing per pass. Tests for everything that's still supposed to work. The four-layer enforcement framework I've been building is the structural version of this. Model prompt, brand-voice skill, pre-commit hook, three-leg gate. Each catching what the layer above missed. The full anatomy lives at four-layer-enforcement-framework.

Failure mode 5: Deployment mismatch

The code works locally and breaks in production.

Diagnostic question: when was the last time the actual deployed version ran successfully end-to-end?

This is the meanest one because the AI is right that the code works. It does work, in the environment the AI tested. Production is a different environment. Different env vars. Different node versions. Different network conditions. Different data shapes from the real database versus your fixture file. The code is correct relative to your dev setup and wrong relative to where it's actually going to run.

I'd put this one last because it shows up last. Usually after the first four are sorted out and you're about to ship. The fix is the boring one. Test in something that looks like production before you call it done. A preview deployment, a staging environment, a docker container with the prod env vars wired in. If your AI agent's workflow doesn't include "run this in production-shaped conditions," the agent will tell you it works and be honestly wrong.

What to do once you know which one

The diagnosis is the leverage. Once you know which failure mode you're staring at, the action gets small fast:

Missing context: make the AI read the file. If it won't, do the reading and paste the file in.
False completion: refuse "done" without proof of running behavior. Define what "done" means before the work starts.
Stale state: hit the rendered URL. Verify what's actually live, not what's in the file.
Untested change: smaller diffs. Run the full test suite after every change.
Deployment mismatch: preview deployment or production-shaped local environment before you call it shipped.

The trap that turns this from "annoying afternoon" into "broken project" is treating every failure like it needs a clever new approach. Most of them don't. They need you to ask the diagnostic question and apply the small fix. The reason AI-built projects get into the rebuild-or-abandon spiral is usually that nobody slowed down enough to figure out which failure mode they were chasing, so they kept applying fixes for the wrong one. Three rounds of that and the codebase has accumulated enough new damage on top of the original problem that "fix it" stops being viable.

The boundary, then, is this: if you can name the failure mode in one sentence, you can probably fix it in an afternoon. If you can't, the project doesn't need more AI. It needs someone to look at the whole thing and figure out where the real problem is before more code gets written.

If you're staring at AI-generated code that doesn't work and the diagnosis isn't obvious from the five above, I can help. Send the repo state, the failure description, what the AI said it did versus what's actually happening, and the deployment context. VibeKoded can scope a rescue diagnostic, stabilization sprint, or rebuild plan. → Work with VibeKoded