Reliability engineering for agent workflows: retries, idempotency, state machines, replay, fallbacks, and incident readiness.

Reliability engineering

Overview

Reliability is achieved through explicit state, predictable retries, and controlled side effects. Agent systems must tolerate partial failures and resume safely without duplication or data corruption.

Key topics

  • State machines and checkpointing for long-running workflows.
  • Retry policies with error taxonomy and budgets.
  • Idempotency and de-duplication for side effects.
  • Fallback strategies and graceful degradation.
  • Operational readiness: runbooks, alerts, and incident processes.

Common pitfalls

  • Retries without idempotency causing duplicate writes.
  • No persistence: failures lose progress and context.
  • Assuming tools are always available and consistent.
  • No alerting on quality regressions or cost anomalies.

Recommended practices

  • Persist state and traces for replay and audit.
  • Design write actions as idempotent, with commit evidence.
  • Implement backoff, jitter, and circuit breakers.
  • Monitor failure clusters and improve systematically.

This page is intended to be actionable for engineering teams. For platform-specific details, cross-reference /platform/agents, /platform/orchestration, and /platform/knowledge.