Evaluation engineering for agents: datasets, scoring, golden sets, regression pipelines, and red teaming.

Evaluation engineering

Overview

Evals are the quality control mechanism for agents. They help you detect regressions from prompt changes, model updates, retrieval drift, and tool behavior changes.

Key topics

  • Golden sets and expected outcomes (including acceptable variance).
  • Automated scoring: schema validity, grounding, and business-rule checks.
  • Human review workflows for subjective judgments.
  • Red teaming scenarios (prompt injection, data leakage, unsafe actions).

Common pitfalls

  • Measuring only ‘helpfulness’ without correctness or evidence checks.
  • No separation between retrieval failure and generation failure.
  • Infrequent eval runs that miss drift and regressions.
  • No tracking of cost/latency alongside quality.

Recommended practices

  • Make evaluation part of CI/CD and nightly regressions.
  • Track quality, latency, and cost together for trade-off decisions.
  • Use failure clustering to prioritize improvements.
  • Keep datasets versioned and representative of production traffic.

This page is intended to be actionable for engineering teams. For platform-specific details, cross-reference /platform/agents, /platform/orchestration, and /platform/knowledge.