Workflow Automation

Self-healing workflows: error handling for autonomous pipelines

A workflow that runs at 2 AM and a workflow that breaks at 2 AM are different products. Patterns for retries, dead-letter queues, idempotency, and LLM-specific failure modes.

02 May 202610 min readKrypto Forge

A workflow that runs every day at 2 AM is a different product from one that runs once when you push a button. The first is infrastructure. The second is a tool. The patterns you need for the first didn't make it into most automation tutorials, which is why so many "automated" pipelines silently break and nobody knows for three weeks.

This is the set of patterns we use to make workflows that heal themselves, or at least fail loudly enough that somebody notices.

The five things that go wrong

Cataloguing failure first makes the patterns make sense.

One: a downstream API is temporarily unavailable. Razorpay's webhook is delayed. GSTN's portal is down. Tally is locked because someone left a modal open on the office desktop. These are transient. The right move is retry.

Two: a downstream API is persistently broken. Wrong credentials, expired token, rate limit you've actually hit. Retrying just makes it worse. The right move is back off and escalate.

Three: the input is malformed. A WhatsApp message your parser can't handle. A photo of a chit too blurry to read. A voice note in a language you didn't expect. The right move is move on, log it, ask for help.

Four: a side effect succeeded but the confirmation didn't. You posted an invoice to Tally, the response timed out, you don't know if it took. The right move is idempotency on the side of the call.

Five: the LLM itself misbehaves. Wrong output format, refused the request, hallucinated a field. Different kind of failure entirely.

A robust workflow handles all five, distinctly. A fragile workflow treats them all as "an error".

Retries done right

The boring pattern: retry on transient failures with exponential backoff and jitter. Cap the retries (usually three). After the cap, escalate.

Specifically:

  • Backoff base: start at 1-2 seconds.
  • Multiplier: 2x per attempt.
  • Jitter: plus-minus 25%, so concurrent failures don't all retry at the same instant.
  • Cap: 3 retries for most workflows; 5 for ones where the downstream is known-flaky.
  • Retry only on known-transient errors. 5xx, network timeouts, rate-limit responses. 4xx errors are usually you, not them. Don't retry those.

In n8n, this is the built-in "Retry on Fail" with backoff. In LangGraph or custom code, you write it. Either way, it's a known recipe.

Dead-letter queues, for the cases retries can't fix

When retries are exhausted, the item goes to a dead-letter queue (DLQ). This is just a Postgres table with the failed payload, the error, the timestamp, and the workflow step. We've used various queue systems; the table is simpler and works fine at SMB scale.

CREATE TABLE dlq_workflow_errors (
  id BIGSERIAL PRIMARY KEY,
  workflow_name TEXT NOT NULL,
  step_name TEXT NOT NULL,
  payload JSONB NOT NULL,
  error_message TEXT NOT NULL,
  error_class TEXT NOT NULL,
  attempt_count INT NOT NULL,
  failed_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  resolved_at TIMESTAMPTZ,
  resolution_note TEXT
);

CREATE INDEX ON dlq_workflow_errors (workflow_name, resolved_at);

Every morning, somebody (or an automated digest) reviews the unresolved rows. Some get retried after a manual fix. Some get classified and added to the prompt as a known-failure case. Some get genuinely escalated.

The DLQ is the difference between a workflow you have to babysit and a workflow you can trust to surface its own problems.

Idempotency, the unsexy one

Idempotency is the property that doing the same operation twice has the same effect as doing it once. It's the only honest answer to "did the operation succeed". If you can retry safely, you don't have to know.

The pattern: every outbound write carries an idempotency key. The downstream system stores recent keys and returns the prior result on a duplicate.

  • Razorpay supports this natively for payment creation.
  • Tally doesn't, so we wrap it in a layer that checks our own database before pushing.
  • GSTN APIs vary; check per endpoint.
  • For internal database writes, use unique constraints on a deterministic key.

Without idempotency, a flaky network costs you double invoices and duplicate payment links. With it, the worst case is a wasted millisecond.

We've made this mistake. Specifically, early Tally integration without idempotency keys, and a flaky network. Double invoices in the wild for a week before we noticed. Don't.

Circuit breakers, for the persistent failures

When a downstream is persistently broken, retrying is worse than not. You're consuming budget, you're consuming time, you might be making the downstream's problem worse.

The circuit-breaker pattern: track failure rate per downstream. When the failure rate over the last N attempts crosses a threshold (we use 50% over 10 attempts), open the breaker. While open, the workflow short-circuits to the DLQ without calling the downstream. After a cool-down (5-15 minutes), try once. If that works, close the breaker.

It's a 50-line pattern. It saves you days of debugging when somebody else's API is the problem and you're spending API budget pretending it's not.

LLM-specific failures

Different category, separate handling.

Malformed output. Always validate against a schema. On validation failure, retry once with the previous (bad) output included in the prompt and an instruction to fix it. If that fails, route to a human.

Refusal. Log it separately. Don't retry on refusal; it almost always means the prompt or the input has something the model is reading as unsafe. Surface it to a human to rephrase or reclassify.

Rate limits. Treated as a transient downstream failure. Retry with backoff. Consider dropping to a smaller model in the meantime if the workflow can tolerate slightly worse output.

Hallucinated fields. Strict output schemas catch some of these. For values that have to exist in the database (a product ID, a customer ID), validate after the call. If the model invented an ID, reject the output and retry with a hint.

A workflow that doesn't distinguish between "the API is down" and "the model is confused" will be useless to debug at 3 a.m. Always log them differently.

Alerting that respects business hours, mostly

A workflow that pages you for every transient error will get muted within a week, after which it might as well not exist.

Our defaults:

  • Per-incident alerts on critical workflows (anything touching money or compliance) page immediately.
  • Hourly digest for non-critical workflows. A summary of what went into the DLQ in the last hour.
  • Daily digest for everything, with counts and links.
  • Weekly review of patterns in the DLQ. This is where we tighten prompts, add examples, or change tool behaviour.

The daily and weekly cadences are the ones that actually drive product improvement. The page-at-3-a.m. version is for the rare real emergency.

The shape that works

Putting it together, every production workflow we ship has roughly this shape:

  • Every external call wrapped in retry-with-backoff.
  • Every retry exhaustion routes to the DLQ.
  • Every outbound write has an idempotency key.
  • Every downstream has a circuit breaker.
  • Every LLM call has output validation and a single retry-with-fix.
  • Every workflow emits structured logs and metrics.
  • Every workflow has an owner whose name is in the runbook.
  • Every workflow has a daily digest and a weekly review.

That sounds like a lot. It's about 40% more code than the naive happy-path version. It pays for itself the first month.

The takeaway

Self-healing workflows aren't self-healing because they're magic. They're self-healing because the people who built them respected the failure modes and wrote separate paths for each one.

The single most leveraged investment in a long-running automation is a good DLQ and a daily review of it. Everything else (idempotency, breakers, smart retries) is the engineering work to keep the DLQ small. The DLQ itself is the operational discipline.

Workflows aren't reliable. People with workflows can be.

Tags

  • reliability
  • workflow
  • error-handling
  • sre
  • automation