Key topics
- Latency budgeting across retrieval, model calls, and tools.
- Caching strategies for retrieval and stable responses.
- Batching and streaming to reduce perceived latency.
- Context minimization and passage quality improvements.
- Cost controls: routing, token caps, and fallback behavior.
Common pitfalls
- Solving everything by increasing context size (token explosion).
- Retry storms during dependency outages.
- No caching for repeated queries and stable KB pages.
- Using expensive models for all tasks.
Recommended practices
- Set per-workflow budgets and enforce them at runtime.
- Use retrieval precision improvements to reduce context size.
- Apply circuit breakers and backpressure on tools.
- Adopt cheap-to-expensive model routing with eval validation.
This page is intended to be actionable for engineering teams. For platform-specific details, cross-reference /platform/agents, /platform/orchestration, and /platform/knowledge.