How to measure AI automation performance

// pre-launch// field-notes6 min read

What gets measured improves. What doesn't get measured drifts. AI automation specifically is prone to drift because the system can run successfully while producing degraded output, so measuring "did it run" is insufficient. You need to measure whether it ran well, which is a different question that most AI automation dashboards don't answer.

I want to walk through the five metrics that actually matter for AI automation performance, then the metrics that look meaningful but aren't, then the instrumentation pattern that makes the right ones visible. The goal is to give you a dashboard that tells you when the system is healthy and tells you specifically what's wrong when it isn't.

Metric one: task completion rate

How often the automation completes a task it was given, without escalating to human review or failing entirely. This is the headline metric for "is the automation actually automating."

The trap is that "completion" usually defaults to "the function returned without erroring." That's too generous. Real completion means the task got done correctly: output passed validation, downstream system accepted the output, the work the automation was supposed to do is actually done.

Track three numbers: tasks attempted, tasks completed (passed all validation), tasks escalated (routed to human review). The completion rate is tasks completed divided by tasks attempted. The escalation rate is tasks escalated divided by tasks attempted. The failure rate is whatever's left.

Healthy AI automation usually shows completion rates in the high 90s and low single-digit escalation rates. Anything dramatically different is a signal that the system isn't doing what it was designed to do.

Metric two: output quality score

For automations where the output goes to a human consumer (drafts, summaries, classifications, responses), quality matters as much as completion. A task can complete successfully while producing output that's technically valid but qualitatively poor.

Quality measurement is harder than completion measurement because it requires judgment. The practical pattern: spot-check a small percentage of outputs (5-10% is usually enough) against a quality rubric. The rubric can be checked by another AI specifically prompted for evaluation, by a human reviewer on a rotating sample, or by both for important workflows.

Track the rubric pass rate over time. The absolute value matters less than the trend. A stable pass rate (even if not perfect) means the system is performing predictably. A declining pass rate is the early warning that something has drifted, possibly silently.

Metric three: latency distribution

How long the automation takes to complete each task. The headline number is usually average latency, but average is misleading for AI automation because the distribution is often bimodal: fast cases and slow cases that drag the average around.

Track p50 (median), p95 (95th percentile), and p99. The p50 tells you the typical experience. The p95 tells you what slower-than-typical looks like. The p99 tells you about the tail (cases that are dramatically slower than the rest, often because they hit rate limits, retries, or complex processing paths).

Healthy latency distribution means p50 is acceptable, p95 is within a reasonable multiple of p50, and p99 doesn't blow up the user experience. When p99 starts climbing without p50 moving, you have a tail-latency problem that's worth investigating.

Metric four: cost per task

How much running the automation costs, broken down per task. This is the metric that determines whether the automation is economically worth doing.

The honest version of this metric includes all the cost layers: AI vendor costs, infrastructure costs, supervision costs, integration costs. Not just the per-call price. The actual operational cost per task. Track it over time, because cost can drift upward as workflows get more sophisticated, prompts get longer, retry rates increase, or rate limits trigger more often.

The threshold that matters: cost per task versus the value the task produces. If a task produces value worth dollars and costs cents to run, the automation is healthy. If the costs creep toward the value, the automation is becoming uneconomical. Tracking this metric gives you the data to make that call.

Metric five: drift indicators

This is the metric class that catches problems other metrics miss. Drift indicators measure properties that should be stable over time, and alert when they change.

Examples: average output length (sudden changes suggest prompt drift or model behavior change), distribution of output categories (a classifier suddenly returning "uncertain" 30% of the time when it used to return it 5% of the time is a signal), vocabulary or phrasing patterns in generated text (significant shifts indicate model version changes), correlation between input characteristics and output decisions (decision boundaries shifting unexpectedly is a model behavior signal).

These aren't intuitive metrics. They're statistical properties of the system that should be stable when nothing has changed. When they shift without anyone deploying a code change, the AI itself has changed underneath you, and the drift indicators are the only signal that will tell you.

Metrics that look meaningful but aren't

A few metrics show up in AI automation dashboards that don't actually tell you what you need to know:

Number of tasks processed. A system can process millions of tasks while doing them all wrong. Volume without quality is noise.

Number of API calls. Tells you about vendor cost but not about whether the work got done.

Uptime percentage. Tells you the service was reachable, not that it was producing correct output.

Average response time. Misleading because of bimodal distribution. Use percentiles instead.

Number of errors. Counts only explicit failures, which is the smallest category of actual failures. Most AI automation problems don't produce errors.

These metrics aren't useless; they're just not sufficient. They tell you something narrow while leaving the most important questions unanswered.

Instrumentation pattern

The pattern that surfaces the right metrics: instrument at the boundaries, not in the middle.

Every input boundary records what entered. Every output boundary records what left. Every step boundary records whether it passed or failed validation. The cumulative record of boundary events lets you compute completion rate, latency distribution, escalation rate, and drift indicators without having to instrument the internal logic.

Boundary instrumentation is also less brittle. The internal logic can change. The validators can update. The prompts can iterate. The boundaries stay stable because they're defined by the schemas, not by the implementation.

The dashboard pattern: a single page showing the five metrics above plus their trend over the last week, month, and quarter. The page should answer "is the system healthy right now?" at a glance, and "where's the problem if there is one?" with one drill-down. Most AI automation dashboards I see show twenty metrics that don't compose into either answer. Five metrics that do compose are more valuable than twenty that don't.

How performance measurement connects to reliability

The metrics above don't fix reliability problems. They surface them. The fix is the reliability design work: validators at boundaries, structured outputs, explicit error handling, observability by construction.

The metrics tell you when the reliability work is needed. The reliability work prevents the metrics from going bad. Together they're a closed loop: design for reliability, measure to detect drift, intervene when measurements signal degradation, refine the design.

Without the measurements, you don't know when to intervene. Without the design, the interventions are firefighting forever. Both are required.

Got AI automation that's running without clear performance metrics, or metrics that don't tell you whether it's healthy? Send the current dashboard, the workflows it covers, and the questions you want it to answer. VibeKoded can scope the workflow, prototype the automation, or ship the production version. → Work with VibeKoded