Performance engineering

Performance engineering for agent systems: latency, throughput, caching, batching, context minimization, and cost controls.

Performance engineering

Overview

Performance is a product feature for agents. Users will not adopt slow systems, and costs can spiral if context and retries are uncontrolled. Performance engineering balances quality with predictable budgets.

Key topics

Latency budgeting across retrieval, model calls, and tools.
Caching strategies for retrieval and stable responses.
Batching and streaming to reduce perceived latency.
Context minimization and passage quality improvements.
Cost controls: routing, token caps, and fallback behavior.

Common pitfalls

Solving everything by increasing context size (token explosion).
Retry storms during dependency outages.
No caching for repeated queries and stable KB pages.
Using expensive models for all tasks.

Recommended practices

Set per-workflow budgets and enforce them at runtime.
Use retrieval precision improvements to reduce context size.
Apply circuit breakers and backpressure on tools.
Adopt cheap-to-expensive model routing with eval validation.

This page is intended to be actionable for engineering teams. For platform-specific details, cross-reference /platform/agents, /platform/orchestration, and /platform/knowledge.