Why Your Agents Get Marked Failed When They Succeeded

Up to a third of agent runs get marked as failed while the work completes fine. Here's why run-level and task-state diverge, and the config that stops it.

AI OperationsAgent CompaniesGuide

Kimmo Nurmisto

Founder, Grolea · 6 min read

You open the dashboard, and a third of the fleet is red. Tasks marked failed, heartbeat runs throwing errors, and a couple of agents are apparently stuck in error status. Then you go read the actual work, and it is fine. The article got written. The PR went up. The task was moved to done. The run that did all of it is still flagged as failed.

I keep hitting this while running multi-agent fleets, so let me say it plainly: a lot of those failures never happened. In one 7-agent fleet documented in Paperclip PR #8077, between 23% and 35% of heartbeat runs were marked failed while the underlying work completed successfully. That is not a fleet breaking down a third of the time. It is a measurement that is wrong a third of the time.

One operator on that thread put it the way everyone running a fleet eventually does:

"This fixes a real, ongoing problem for us. +1 to merging"

The gap matters because you act on the number. A 30% failure rate sends you debugging agents that work, throttling throughput that is fine, and slowly losing trust in a fleet that is doing its job. So before you go chasing ghosts, here is what is actually happening and the config that closes the gap.

The red dashboard problem

The symptom shows up in three places that look unrelated until you line them up.

A task gets marked failed even though its output is sitting right there, complete. A heartbeat run reports an error on an agent that, as far as the work goes, did everything you asked. An agent shows as stuck in error status while the thing it was supposed to produce already exists downstream.

Operators search for these one at a time. "Agent marked failed." "Heartbeat run failed." "Agent stuck in error status." They feel like three different bugs. They are usually one, and it is not a bug in your agents. The fleet is deciding what "failed" means using the wrong signal.

What is actually happening: run-level versus task-state divergence

Two different things are being measured, and most setups only watch one of them.

The run is the execution window. In a heartbeat system, that is one wake-up: the process starts, does some work, and exits. The task state is the thing the work was for: did the issue advance, did the PR open, did the record move to done?

Most failure detection reads the run. Did the process exit cleanly? Did it time out? Did the heartbeat throw on the way out? Useful signals, but none of them asks the only question that decides whether real work happened: did the task reach the state it was supposed to reach?

When those two answers disagree, you get a false failure. The common shapes:

The task moves to done, then the run times out a few seconds later during cleanup. Task state says done. Run says failed. The dashboard shows the run.
The work commits successfully, and the heartbeat throws while writing its final status. The output is real; the run is red.
The agent finishes early and exits in a way that the harness reads as a non-zero error rather than a clean stop.

In that operator's failure-rate table on PR #8077, the striking part was not the headline number. It was that the false failures were spread across the fleet rather than concentrated in one flaky agent. The rate was comparable across all seven, from about 23% on one role to 35% on another. Seven independent agents do not break in the same way at the same rate by coincidence. That spread is the signature of a measurement artifact, not a flaky fleet.

The config that closes the gap

The fix is to reconcile the two signals before you report either one. Concretely:

Define success by the task-state transition, not the run exit. If the task reached its target state, the run succeeded, regardless of how the process happened to terminate.

Demote run-level errors that occur after the task has already advanced. A timeout or a throw on cleanup, once the work has landed, is a warning worth logging, not a failure worth alerting on.

Add a reconciliation pass before anything gets marked failed. When a run looks like it errored, check the state of the task it owned. If that task hit its target, close the run as succeeded and move on. Only when the task does not advance, and there is no live continuation, do you have a genuine failure on your hands.

That last distinction is the one that also cleans up your "stuck" count. A truly stuck agent is one whose task never advances, and that has no run still working on it. Everything else flagged as stuck is noise from the same divergence.

What to audit in your own fleet

You do not need the same patch to find out whether you have the same problem. Run this:

Pull your current failure rate, then open ten runs marked failed and read what the agent actually produced. Count how many have completed the correct work attached. That ratio is your false-failure rate.
Check what your failure signal reads. If it keys off process exit or heartbeat status and never looks at task state, you are measuring the run, not the work.
Look at the timing. False failures cluster in the seconds after a task moves to done. If your errors bunch up right after completion, that is the cleanup-window artifact.
Separate "stuck" from "noisy." For every agent showing stuck, confirm whether its task has advanced. The ones that advanced are mislabeled.

If most of your red turns out to be green underneath, fix the measurement before you touch the agents. Reliability you can trust starts with a number you can trust.

This is one half of a pair. The other is what happens when a task genuinely does drop: detection, reassignment, and the escalation path that decides who picks it up. False failures and dropped tasks are the same infrastructure conversation read from two ends, and I am writing the dropped-task half next. The single-agent version of that recovery, bounded retries, checkpoints, and escalation when one agent stalls, is what to do when your AI agent gets stuck.

If you are standing up a fleet and want the org-level scaffolding around this, I wrote up how the roles, budgets, and recovery fit together in how to set up an AI agent company.

I publish what I learn running these fleets in production in The Rewrite. If you operate one too, follow along.

The red dashboard problem

What is actually happening: run-level versus task-state divergence

The config that closes the gap

What to audit in your own fleet

How to Set Up a Multi-Agent Team

What to Do When Your AI Agent Gets Stuck