The Usual Suspects: A Canonical Agent Architecture
You can’t evaluate what you can’t decompose. Before diving into what makes agentic eval hard, it helps to have a shared picture of what we’re evaluating. This article outlines a canonical architecture for agentic systems — not as a prescriptive blueprint, but as a map of the components where things go wrong and evaluation needs to happen.
The Building Blocks
The architecture below isn’t prescriptive — your system may not have all of these components, or may combine them differently (see Azure’s overview of agent orchestration patterns for a complementary taxonomy) — but it captures the key building blocks that appear across most production agentic systems I’ve seen. Each component is a place where things can go wrong, and therefore a place where evaluation needs to happen.
flowchart TB
Human["Human Inputs<br/>(requests, clarifications,<br/>approvals, escalations)"]
Retrieval["Retrieval Service<br/>(documents, knowledge bases,<br/>structured data)"]
Context["Context Manager<br/>(prompt assembly, history<br/>compression, context scoping)"]
Auth["Authentication &<br/>Authorization<br/>(permissions, data access,<br/>action controls)"]
Human -->|"request / feedback"| Context
Retrieval -->|"retrieved<br/>context"| Context
Planner["Planner<br/>(task decomposition,<br/>step sequencing)"]
Executor["Executor<br/>(plan execution,<br/>step-by-step progression)"]
Tools["Tool Invocation<br/>(APIs, databases, code<br/>execution, MCP)"]
SubAgent["Sub-Agent Invocation<br/>(delegated tasks with<br/>own context & planning)"]
Dispatcher["Dispatcher<br/>(invokes model, returns<br/>response or dispatches)"]
Router["Model Router"]
Context -->|"assembled<br/>context"| Planner
Context -->|"assembled<br/>context"| Executor
Planner -->|"planning<br/>request"| Dispatcher
Executor -->|"execution<br/>request"| Dispatcher
Dispatcher -->|"model<br/>selection"| Router
Router -->|"model<br/>response"| Dispatcher
Dispatcher -->|"generated<br/>plan"| Context
Dispatcher -->|"tool calls"| Tools
Dispatcher -->|"delegated<br/>tasks"| SubAgent
Dispatcher -->|"execution<br/>outcome"| Context
Planner -->|"plan"| Executor
Executor -->|"execution<br/>outcomes"| Planner
Tools -->|"tool results"| Context
SubAgent -->|"sub-agent<br/>results"| Context
Planner -->|"responses &<br/>escalations"| Human
Auth -.->|"gates"| Tools
Auth -.->|"gates"| SubAgent
Auth -.->|"gates"| Retrieval
Retrieval Service. The RAG component that fetches organizational context — documents, knowledge bases, structured data — relevant to the current task. This is distinct from the explicit context the user or system provides to the agent; retrieval augments that context with information the agent wouldn’t otherwise have access to. The quality of what gets retrieved (and what doesn’t) has an outsized effect on everything downstream.
Context Manager. The layer responsible for assembling, structuring, and maintaining the agent’s working context throughout execution. This includes deciding what goes into the prompt, how execution and conversation history is compressed or summarized as the context window fills up, how retrieved content is ranked and sub-selected, and how context is scoped when delegating to sub-agents. Context management is arguably the most underappreciated component in agentic systems — and one of the most common sources of subtle failures.
Dispatcher. The central component that receives requests from both the planner and executor, delegates model invocation to the model router, and then inspects the response to decide what happens next: if the model returns tool calls or sub-agent delegations, the dispatcher routes them accordingly; if the planner requested a plan, the dispatcher sends the generated plan back to the context manager for assembly into the next iteration’s context; and execution outcomes — the results of model calls, tool invocations, and sub-agent delegations — flow back through the context manager the same way. The dispatcher also handles response parsing, retry logic, and error recovery.
Model Router. The decision layer that sits behind the dispatcher and selects which model to invoke for a given request (RouteLLM is a notable example). When the dispatcher needs to invoke a model, it delegates model selection to the router, which chooses from the available models based on task complexity, model-specific strengths (e.g., reasoning vs. code generation), latency requirements, cost thresholds, and even consensus mechanisms where multiple models are queried and their outputs compared. The model’s response flows back through the router to the dispatcher. The router’s decisions directly affect both cost and quality.
Tool Invocation. The mechanism by which the agent calls external tools — APIs, databases, file systems, code execution environments, and anything else exposed via protocols like MCP. Tool invocation involves parameter construction, response handling, error recovery, and deciding when to retry versus when to abandon a tool call and replan.
Sub-Agent Invocation. Where the orchestrating agent delegates a portion of the task to another agent. This is tool invocation’s more complex cousin: the sub-agent has its own context, its own planning, its own tool access, and its own failure modes. The orchestrating agent must pass the right context downstream and correctly interpret what comes back — including partial results, errors, and escalations.
Planner. The reasoning component that decomposes a user’s request into a sequence of steps, decides which tools and sub-agents to involve, and adapts the plan as execution unfolds. Some agents have an explicit planning step; others plan implicitly through chain-of-thought reasoning. Either way, the quality of the plan largely determines the quality of the outcome.
Human Inputs. The points at which a human interacts with the agent — not just the initial request, but mid-task clarifications, course corrections, approvals, and escalation responses. In multi-turn settings, human inputs can redirect the agent’s plan at any point, and the agent must handle this gracefully.
Authentication & Authorization. The controls that govern what the agent is allowed to do — which tools it can invoke, what data it can access, what actions it can take on behalf of the user. This is especially critical when agents handle sensitive data or perform actions with real-world consequences. Auth failures aren’t just security issues; they’re also eval-relevant, because an agent that lacks the right permissions will fail in ways that look like capability failures if you’re not measuring auth separately.
Why This Matters for Evaluation
Each of these components introduces its own failure modes and quality dimensions. A retrieval service that returns irrelevant context poisons everything downstream. A context manager that drops critical history mid-session causes the agent to “forget.” A router that sends complex reasoning tasks to a lightweight model gets cheap but wrong answers. A planner that can’t adapt when tools fail produces brittle agents that work in demos and break in production.
The articles that follow in this series map specific evaluation concerns to the components where they arise. Agentic systems introduce at least eight dimensions of complexity that make evaluation fundamentally different from evaluating models in isolation — a future article will explore each in greater detail:
- Context complexity — shifting, compressed, and degrading context as sessions grow.
- Tool & sub-agent calls — parameter construction, error handling, and information loss at every handoff.
- Planning complexity — decomposition quality and dynamic replanning when conditions change.
- Multi-turn conversations — mid-session pivots, contradictions, and context window pressure.
- Multi-agent coordination — delegation chains, context scoping, and aggregation across sub-agents.
- HITL interactions — escalation timing, framing, and bridging automated execution with human judgment.
- Ambient & non-conversational agents — fully autonomous error recovery with no human safety net.
- Privacy & sensitivity — data flow controls that constrain both execution and evaluation.
The Two Core Loops
The full architecture diagram above can be decomposed into two distinct loops that drive agent behavior: the planning loop and the execution loop. Understanding these loops separately makes it easier to reason about where failures occur and what each component contributes.
The Planning Loop
The planning loop is the agent’s reasoning cycle. A user request enters through the context manager, which assembles the relevant context and passes it to the planner. The planner formulates a planning request, which the dispatcher sends to a model (via the router). The model’s generated plan flows back through the dispatcher to the context manager, where it is assembled into the context for the next iteration. The planner can then respond to the human or iterate — refining the plan through additional model calls before handing it off for execution.
flowchart TB
Human["Human Inputs<br/>(requests, clarifications,<br/>approvals, escalations)"]
Context["Context Manager<br/>(prompt assembly, history<br/>compression, context scoping)"]
Planner["Planner<br/>(task decomposition,<br/>step sequencing)"]
Dispatcher["Dispatcher<br/>(invokes model, returns<br/>response or dispatches)"]
Human -->|"request / feedback"| Context
Context -->|"assembled<br/>context"| Planner
Planner -->|"planning<br/>request"| Dispatcher
Dispatcher -->|"generated<br/>plan"| Context
Planner -->|"responses &<br/>escalations"| Human
classDef active fill:#dbeafe,stroke:#2563eb,color:#1e3a5f,stroke-width:2px
class Human,Context,Planner,Dispatcher active
linkStyle 0,1,2,3,4 stroke:#2563eb,stroke-width:2px
The Execution Loop
The execution loop is the agent’s action cycle. Once the planner produces a plan, the executor takes over. It sends execution requests through the dispatcher to a model, which returns tool calls or sub-agent delegations. The dispatcher routes these to the appropriate tool or sub-agent, whose results flow back through the context manager — which assembles the updated context — and back to the executor for the next step. Once execution completes or runs into an issue, the executor passes the outcome back to the planner, which can continue planning or replan as needed until the task is fully resolved.
flowchart TB
Planner["Planner<br/>(task decomposition,<br/>step sequencing)"]
Executor["Executor<br/>(plan execution,<br/>step-by-step progression)"]
Context["Context Manager<br/>(prompt assembly, history<br/>compression, context scoping)"]
Dispatcher["Dispatcher<br/>(invokes model, returns<br/>response or dispatches)"]
Tools["Tool Invocation<br/>(APIs, databases, code<br/>execution, MCP)"]
SubAgent["Sub-Agent Invocation<br/>(delegated tasks with<br/>own context & planning)"]
Planner -->|"plan"| Executor
Executor -->|"execution<br/>request"| Dispatcher
Dispatcher -->|"tool calls"| Tools
Dispatcher -->|"delegated<br/>tasks"| SubAgent
Dispatcher -->|"execution<br/>outcome"| Context
Tools -->|"tool results"| Context
SubAgent -->|"sub-agent<br/>results"| Context
Context -->|"assembled<br/>context"| Executor
Executor -->|"execution<br/>outcomes"| Planner
classDef active fill:#dcfce7,stroke:#16a34a,color:#14532d,stroke-width:2px
class Planner,Executor,Context,Dispatcher,Tools,SubAgent active
linkStyle 0,1,2,3,4,5,6,7,8 stroke:#16a34a,stroke-width:2px
About This Series
This series lays out the problem and offers a practical framework for agentic evaluation — one that accounts for the dimensions of complexity agents introduce, maps concrete metrics to each dimension, and acknowledges where the industry still doesn’t have good answers.
- No Country for Old Benchmarks — Series thesis: why agentic evals need a new paradigm.
- The Usual Suspects — A canonical architecture for agentic systems, identifying the components where things go wrong (this article).
- Mission Impossible — The eight dimensions of complexity that make agentic evaluation fundamentally different. (Coming Soon)
- Who Watches the Watchmen — LLM-as-a-judge, overall agent effectiveness, and retrieval quality metrics. (Coming Soon)
- Inception — Evaluating planning quality: structural correctness, effectiveness, robustness, and grounding. (Coming Soon)
- The Good, The Bad, and The Trajectory — Trajectory-level metrics that measure execution quality. (Coming Soon)
- The Matrix — Routing quality: how agents choose which tools, models, and sub-agents to invoke. (Coming Soon)
- Moneyball — Choosing the right metrics for your agent, plus resilience strategies for production. (Coming Soon)