Key topics
- Golden sets and expected outcomes (including acceptable variance).
- Automated scoring: schema validity, grounding, and business-rule checks.
- Human review workflows for subjective judgments.
- Red teaming scenarios (prompt injection, data leakage, unsafe actions).
Common pitfalls
- Measuring only ‘helpfulness’ without correctness or evidence checks.
- No separation between retrieval failure and generation failure.
- Infrequent eval runs that miss drift and regressions.
- No tracking of cost/latency alongside quality.
Recommended practices
- Make evaluation part of CI/CD and nightly regressions.
- Track quality, latency, and cost together for trade-off decisions.
- Use failure clustering to prioritize improvements.
- Keep datasets versioned and representative of production traffic.
This page is intended to be actionable for engineering teams. For platform-specific details, cross-reference /platform/agents, /platform/orchestration, and /platform/knowledge.