End-to-end observability for agent workflows: traces, metrics, logs, quality signals, and cost attribution.

Observability

Overview

Operating agents requires visibility into what happened and why. Observability spans model calls, retrieval, tool invocations, human approvals, and final outputs.

What is captured

  • Traces per run and per step (planning, retrieval, tool calls, verification).
  • Latency breakdown and dependency timing.
  • Cost attribution (tokens, tool usage) by workflow and by tenant.
  • Error taxonomy: schema failures, policy denials, tool errors, timeouts.
  • Quality signals: citation coverage, groundedness, escalation rate.

Operational outcomes

  • Faster incident triage: pinpoint failing tools and steps.
  • Controlled spend: detect runaway runs and optimize routing/context.
  • Quality improvements: identify failure modes and drive eval coverage.
  • Compliance readiness: produce audit trails for actions and access.