Evaluation building blocks
- Golden sets: curated queries with expected outcomes and acceptable variance.
- Task metrics: correctness, completeness, structured-output validity.
- Grounding metrics: citation coverage, evidence alignment, refusal correctness.
- Operational metrics: latency, cost, tool error rate, escalation rate.
Where evaluation is applied
- Pre-deployment: validate new workflows and integrations.
- CI/CD: prevent regressions after prompt/model/tool changes.
- Runtime: detect drift (data freshness, tool behavior, user patterns).
- Security testing: prompt injection scenarios and data exfiltration attempts.
Practical approach
- Start with high-impact workflows and define ‘must not fail’ test cases.
- Add structured checks (schemas, business rules) before subjective scoring.
- Automate nightly eval runs; alert on regressions and cost spikes.
- Use failures as feedback to improve retrieval, policies, and tooling.