Key topics
- State machines and checkpointing for long-running workflows.
- Retry policies with error taxonomy and budgets.
- Idempotency and de-duplication for side effects.
- Fallback strategies and graceful degradation.
- Operational readiness: runbooks, alerts, and incident processes.
Common pitfalls
- Retries without idempotency causing duplicate writes.
- No persistence: failures lose progress and context.
- Assuming tools are always available and consistent.
- No alerting on quality regressions or cost anomalies.
Recommended practices
- Persist state and traces for replay and audit.
- Design write actions as idempotent, with commit evidence.
- Implement backoff, jitter, and circuit breakers.
- Monitor failure clusters and improve systematically.
This page is intended to be actionable for engineering teams. For platform-specific details, cross-reference /platform/agents, /platform/orchestration, and /platform/knowledge.