What is captured
- Traces per run and per step (planning, retrieval, tool calls, verification).
- Latency breakdown and dependency timing.
- Cost attribution (tokens, tool usage) by workflow and by tenant.
- Error taxonomy: schema failures, policy denials, tool errors, timeouts.
- Quality signals: citation coverage, groundedness, escalation rate.
Operational outcomes
- Faster incident triage: pinpoint failing tools and steps.
- Controlled spend: detect runaway runs and optimize routing/context.
- Quality improvements: identify failure modes and drive eval coverage.
- Compliance readiness: produce audit trails for actions and access.