Evaluation engineering

Evaluation engineering for agents: datasets, scoring, golden sets, regression pipelines, and red teaming.

Overview

Evals are the quality control mechanism for agents. They help you detect regressions from prompt changes, model updates, retrieval drift, and tool behavior changes.

Key topics

Golden sets and expected outcomes (including acceptable variance).
Automated scoring: schema validity, grounding, and business-rule checks.
Human review workflows for subjective judgments.
Red teaming scenarios (prompt injection, data leakage, unsafe actions).

Common pitfalls

Measuring only ‘helpfulness’ without correctness or evidence checks.
No separation between retrieval failure and generation failure.
Infrequent eval runs that miss drift and regressions.
No tracking of cost/latency alongside quality.

Recommended practices

Make evaluation part of CI/CD and nightly regressions.
Track quality, latency, and cost together for trade-off decisions.
Use failure clustering to prioritize improvements.
Keep datasets versioned and representative of production traffic.

This page is intended to be actionable for engineering teams. For platform-specific details, cross-reference /platform/agents, /platform/orchestration, and /platform/knowledge.

Evaluation engineering

Overview

Navigate

Key topics

Common pitfalls

Recommended practices