Who has time for babysitting AI

// pre-launch// field-notes7 min read

The pitch was that AI would free up your team's time. The reality is that your team now spends an hour a day reviewing AI output, catching the mistakes, correcting the wrong assumptions, retrying the failed runs, and patching the integration that just broke for the third time this week. The automation that was supposed to give time back is the thing eating most of your time.

This isn't an indictment of AI. It's a design problem. AI systems built without specific discipline against the failure modes end up requiring constant human supervision to produce acceptable output. The same systems built with the discipline run autonomously and produce results you trust without watching them. The difference between the two outcomes isn't the AI. It's what's wrapped around the AI.

I want to walk through the four ways AI work becomes supervision, then the pattern that eliminates each one. The goal is to get you back to the original promise: automation that runs without you watching it.

Babysitting one: every output gets manually reviewed

The AI produces a draft. You can't ship the draft without checking it. So you read every output, decide whether it's acceptable, fix the parts that aren't. The "automation" is producing first drafts that need human judgment before they can be used.

This is the most common shape of AI babysitting. It looks like automation because the AI is doing work. It feels like a job because you're still in the loop on every output.

The fix is to constrain the output space until "acceptable" is something the system can verify automatically. Structured outputs that match a schema. Validators that check the output against rules before it propagates. Rejection of outputs that don't pass the validator with automatic retry against a tightened prompt. When the verification is mechanical, you stop being part of the loop except when the system flags a genuine edge case.

This shifts your work from "review every output" to "review the small set of outputs the system couldn't verify." The cumulative time savings is meaningful.

Babysitting two: every edge case requires intervention

The AI handles 90% of cases fine. The other 10% are edge cases the prompt didn't anticipate. Each edge case requires you to step in, decide what to do, and either handle it manually or update the prompt. Over time, the edge cases stack up and the maintenance burden grows even though the system was supposed to be running itself.

The fix is to define the edge cases as part of the system, not as exceptions to it. Either the system handles each known edge case explicitly (a branch in the workflow, a separate prompt, a fallback path), or it explicitly routes unknown edge cases to human review without requiring you to discover them yourself.

The discipline is that "we'll handle it when it comes up" is not a working strategy. Every edge case that requires intervention is a hidden cost. Naming them and either automating them or routing them deliberately is what makes the system actually run.

Babysitting three: prompts drift and require correction

The prompt that worked last month produces slightly different output this month. The model updated, or the input distribution shifted, or temperature behavior is different than it was. The drift isn't dramatic enough to break anything obviously, but it's enough that you're constantly tweaking the prompt to keep the output where you want it.

The fix is prompt versioning plus regression testing. Every prompt has a version. Every version has a test suite (a small set of representative inputs with expected outputs). When the test suite fails, you know the prompt drifted before the production output suffers. When you update the prompt, you know it passes the regression suite before deploying.

This sounds heavy. In practice it's a few test cases per prompt and a scheduled run. The cost is small. The savings is not having to fight prompt drift manually.

Babysitting four: integrations keep breaking

The AI workflow touches several external systems. Each one occasionally changes its API, returns unexpected errors, or behaves differently under load. Each break requires you to investigate, patch the integration, and verify the workflow runs again.

The fix is the same shape as prompt versioning but for integrations: contract tests that exercise each external dependency the way the workflow does, scheduled runs that catch drift before production does, alerts that surface failures with enough context to fix quickly.

The combined effect of contract testing and explicit failure routing is that integration breaks announce themselves the moment they happen, with enough information to fix, instead of cascading into mystery downstream symptoms.

What babysitting actually costs

The visible cost is the operator time spent supervising. That's measurable. The less visible costs are bigger.

Opportunity cost: the operator time spent on AI supervision isn't time spent on the work the AI was supposed to free them for. The whole point of the automation was leverage. Without leverage, you have a system that consumes attention rather than producing it.

Trust erosion: when AI requires constant supervision, the team starts treating its output as suspect by default. The skepticism propagates. People stop trusting the system to do the work, which means they stop using it for the work that genuinely is automatable. You lose the willingness to lean on it for harder tasks.

Scaling failure: a system that requires supervision can't scale past the operator's capacity. The throughput limit is your time, not the system's. When demand increases, the supervision burden increases proportionally, and the automation breaks at exactly the moment you most need it to work autonomously.

Each of these compounds. A team six months into an "automation" that's really supervision is in a worse position than a team that never built the automation at all, because the supervision is now embedded in workflows that depend on it.

The pattern that eliminates babysitting

The pattern is to design verification into the system from the start, not as an afterthought. Three layers:

Boundary validation. Every input gets validated against a schema. Every output gets validated against a schema. Invalid data fails loudly at the boundary, not silently downstream.

Confidence scoring. The system has an explicit notion of "I'm sure" versus "I'm guessing." High-confidence outputs run autonomously. Low-confidence outputs route to human review explicitly. Most outputs are high-confidence; the human reviews the small set that aren't.

Observability. The system surfaces what it's doing in ways the operator can spot-check on demand without watching continuously. Dashboards, periodic reports, alert thresholds. The operator can verify the system is healthy in minutes a day, not hours.

This is the same pattern I use in my own build pipeline. Catch-at-pre-implementation is the discipline that means I don't have to watch every code generation as it happens. The validators catch what would have required my attention. The build pipeline catches what the validators don't. The three-leg gate catches what the pipeline doesn't. By the time something would need my attention, it's been escalated explicitly with full context. The system runs without my watching it; I show up when there's a real decision.

When AI genuinely needs human-in-loop

There are real situations where human review is the right design choice, not a failure of automation. High-stakes decisions where the cost of a wrong output exceeds the cost of review. Creative work where the human's taste is the differentiator. Edge cases where ambiguity is genuine.

The discipline is to distinguish these (real human-in-loop) from the babysitting case (incidental human-in-loop because the system wasn't designed well enough to run autonomously). Real human-in-loop is bounded, planned, and adds value. Babysitting is open-ended, accidental, and subtracts time.

If you're spending an hour a day on AI supervision and you can't name which of those two it is, it's probably babysitting. The diagnostic is to ask whether your judgment is genuinely required for each review, or whether the review is just catching things the system should be catching on its own.

Got AI automation that's consuming more attention than it gives back? Send the current workflow, the babysitting pattern you keep doing, and the failure modes you're catching manually. VibeKoded can scope the workflow, prototype the automation, or ship the production version. → Work with VibeKoded