Why passing tests does not mean the app is fixed

// pre-launch// methodology8 min read

The AI fixed the bug. The tests pass. The AI reported success. You opened the app to verify and it's still broken in exactly the way it was broken before. The tests passed; the bug remains.

This pattern is one of the most consistent rescue cases and one of the most easily misunderstood. The instinct is to either blame the tests (they're incomplete) or blame the AI (it lied). Both interpretations miss the structural cause. Tests verify what tests are designed to verify, which is necessary but not sufficient for verifying that an app actually works. AI agents report success against tests because that's what they were asked to do; they're not lying about test status, they're just reporting on the wrong dimension.

The fix isn't more tests, exactly. The fix is acceptance gates that include checks tests can't perform. The four-leg pattern below is what catches the false-pass cases and gives "fixed" a meaning that matches reality rather than matches test output.

What tests can verify

Tests verify the specific assertions written into the test code. If the test says "calling this function with input X should return output Y," the test verifies that assertion. If Y is returned, the test passes. If something else is returned, the test fails.

This is genuinely useful. Tests catch regressions (something that used to work no longer does), specification violations (the function returns the wrong type, throws the wrong error, has the wrong signature), and obvious logical errors (a calculation that produces wrong arithmetic).

The verification is structural: tests verify what they were written to verify. They don't verify what they weren't written to verify.

What tests can't verify

Tests don't verify the behaviors nobody thought to test. If the original test suite focused on the happy path and didn't include the edge case where users with permissions A and B encounter scenario C, the test suite won't fail when scenario C is broken.

Tests don't verify integration behavior beyond what the integration tests explicitly check. Two components that both have passing unit tests can fail when integrated if the integration tests don't cover the specific interaction.

Tests don't verify rendering, layout, or user-facing presentation unless visual regression tests exist. A page can render visibly wrong (overlapping text, missing buttons, broken layout) while all functional tests pass.

Tests don't verify deployment, environment-specific behavior, or runtime configuration. An app can have all unit and integration tests passing in CI and still fail in production due to environment differences.

Tests don't verify data integrity beyond what specific data-validation tests check. Operations can succeed at the function level while corrupting data at the application level.

Tests don't verify that the change being tested actually addressed the original problem. A test added to verify a fix can pass while the user-experienced bug remains, if the test was designed against an incorrect understanding of the bug.

The structural cause of false-pass

The AI's report of "fixed" is based on the AI's available verification, which is usually the test suite plus whatever quick check it ran. The AI is being accurate about test status; the test status just doesn't equal "the app is fixed."

This isn't an AI failure mode specifically. It's a structural property of tests as a verification tool. Tests verify a subset of behavior. The subset they verify is necessarily smaller than the full behavior of the app. The gap is where false-pass cases live.

What AI does is make this gap more consequential because AI changes can be larger and more frequent than human changes. When humans were making the changes, the rate of change was naturally bounded; the false-pass cases existed but at low volume. When AI accelerates the rate, false-pass cases scale with it.

The four-leg acceptance pattern

The pattern that addresses false-pass: every change goes through four legs of verification, not one.

Leg one: tests pass. This is the leg the AI's report covers. Tests run, tests pass. Necessary baseline.

Leg two: surface check. Verify the visible behavior of the change. Open the app. Trigger the affected flow. Observe what actually happens. Compare to what should happen.

This catches the cases where tests pass but the visible behavior is wrong. A button that's supposed to submit a form might satisfy its test (the click handler runs) while doing nothing visible to the user (the form doesn't actually submit).

Leg three: semantic check. Verify the underlying state matches what the surface suggests. Check the database. Trace the operation. Inspect the data that was supposed to be created, modified, or deleted.

This catches the surface-vs-semantic mismatches where the visible behavior looks right but the underlying state is wrong. The form looks submitted (the page says "thank you") but the submission didn't actually go through (the database has no record of it).

The pattern is documented in three catches at the surface vs semantic boundary and the gate that caught us, both of which describe specific cases where this leg caught what tests and surface checks missed.

Leg four: integration check. Verify the change doesn't break adjacent things that depend on the changed code. Run flows that touch the same components. Check that data the changed code produces is consumed correctly by downstream code.

This catches the regression cases where the change works in isolation but breaks integration with other parts of the system.

When all four legs pass, the change is genuinely fixed. When only some pass, the change is incomplete and needs more work. The acceptance gate is "all four legs pass," not "tests pass."

How to design acceptance gates that include all four legs

The acceptance gate is the checklist (manual, automated, or hybrid) that runs before a change is considered done. Designing it well:

Make leg one (tests) the cheapest layer. Tests should run on every change automatically. Pre-commit hooks, CI pipelines, watch-mode test runners. The cost per run should be low enough that running them constantly is sustainable.

Make leg two (surface) part of the change workflow. Don't accept a change without actually opening the app and verifying the visible behavior. This is the leg most often skipped because it requires manual attention. The cost of skipping it is the false-pass case where tests passed but the app is still broken.

Make leg three (semantic) explicit for changes that affect state. When a change modifies data, the acceptance gate should include verifying the data is correct after the change. When a change affects integrations, the gate should include verifying the integration actually works.

Make leg four (integration) part of the test suite where possible. Integration tests are slower than unit tests and tend to be skipped in fast-iteration loops. The skip is fine for cheap changes; for changes that affect surfaces with many consumers, integration tests should be part of the acceptance gate even if they slow down the loop.

The codification loop applied to false-pass

Each false-pass case is raw material for improving the acceptance gate. When you discover that tests passed but the app was still broken, the question to ask is: what check would have caught this that wasn't in the gate?

The answer becomes the next codified gate. Specific check for the specific failure pattern, added to the acceptance criteria, runs on every subsequent change. Over time, the gate accumulates checks for the specific false-pass patterns your project has experienced, and the rate of false-pass drops.

This is the same Move 1 → Move 2 codification loop covered in why AI keeps changing your code applied to the acceptance-gate dimension. Each diagnostic becomes inherited prevention.

What this looks like in practice

The acceptance gate I run on my own production work has four explicit legs that match the pattern above. Tests run via the codified template. Surface verification happens during the promotion step. Semantic verification runs through specific checks for data-affecting changes. Integration verification runs through the recursive validator that resolves cross-references across the system.

The gate catches false-pass cases that any single leg would miss. The discipline isn't expensive (each leg is bounded) and the prevention is meaningful (most "I thought this was fixed but it isn't" cases don't survive the four-leg gate).

The framework integrates with the broader four-layer enforcement discipline, where the acceptance gate is the third layer (automated and manual gates against AI-generated output before promotion).

What to do if you're currently in false-pass loops

If your AI-assisted work is producing changes that pass tests but don't actually work, the four-leg gate is the path forward. Install it incrementally:

Start with leg two (surface checks). Make manual surface verification non-optional for every change. This is the easiest leg to add and catches most false-pass cases that tests miss.

Add leg three (semantic checks) for changes that affect state. The cost is bounded because not every change needs the semantic check; only the ones that modify data or integrations need it.

Strengthen leg four (integration checks) as integration tests grow. This is the slowest leg to install and the longest-running, but it provides the regression protection that tests-and-surface-checks can't.

Keep leg one (tests) maintained. Don't let test coverage drop because surface checks are catching the gaps. Tests are still the cheapest layer and still necessary.

Over a few weeks of deliberate practice, the four-leg gate becomes habitual. The false-pass rate drops to near-zero. The AI's report of "fixed" becomes accurate against the gate (because the gate is what was actually checked) rather than misleading against your expectations of what "fixed" should mean.

If you're stuck in false-pass loops where tests pass but the app keeps being broken, send the failure pattern, the current acceptance criteria, and the kinds of changes that are slipping through. VibeKoded can scope a rescue diagnostic, stabilization sprint, or rebuild plan. → Work with VibeKoded