Performance engineering for agent systems: latency, throughput, caching, batching, context minimization, and cost controls.

Performance engineering

Overview

Performance is a product feature for agents. Users will not adopt slow systems, and costs can spiral if context and retries are uncontrolled. Performance engineering balances quality with predictable budgets.

Key topics

  • Latency budgeting across retrieval, model calls, and tools.
  • Caching strategies for retrieval and stable responses.
  • Batching and streaming to reduce perceived latency.
  • Context minimization and passage quality improvements.
  • Cost controls: routing, token caps, and fallback behavior.

Common pitfalls

  • Solving everything by increasing context size (token explosion).
  • Retry storms during dependency outages.
  • No caching for repeated queries and stable KB pages.
  • Using expensive models for all tasks.

Recommended practices

  • Set per-workflow budgets and enforce them at runtime.
  • Use retrieval precision improvements to reduce context size.
  • Apply circuit breakers and backpressure on tools.
  • Adopt cheap-to-expensive model routing with eval validation.

This page is intended to be actionable for engineering teams. For platform-specific details, cross-reference /platform/agents, /platform/orchestration, and /platform/knowledge.