No Country for Old Benchmarks // Potpourri

Traditional ML eval frameworks measure accuracy on static benchmarks. Agents operate in dynamic, multi-step environments with non-deterministic tool use, branching decisions, and environment-dependent outcomes. The industry needs a fundamentally new eval paradigm — one that measures trajectory quality, recovery from failure, cost-efficiency of tool use, and alignment with user intent across entire sessions, not individual outputs.

The Eval Gap Nobody Talks About

LLM-based agents have gone from research curiosity to production reality remarkably fast. Coding agents like Codex, Claude Code, and Gemini CLI now orchestrate multi-file edits and tool calls autonomously. Frameworks like LangGraph, CrewAI, AutoGen, and Google’s ADK provide opinionated scaffolding for building multi-step, tool-using agents. As Andrew Ng recently highlighted, we’re seeing a shift where agentic workflows — iterative loops of thinking and acting — are often more impactful than just scaling up the underlying model parameters.

This isn’t just changing products — it’s changing how we build software. Architectures are being rethought to accommodate agentic patterns: human-in-the-loop checkpoints, tool registries with dynamic dispatch, and context management layers that didn’t exist before. The Model Context Protocol (MCP) is rapidly becoming a standard for tool integration, effectively creating a universal interface for agents to interact with the world. If you’re an engineer at any scale, you’re either already building agents or you’re about to be.

And yet, there’s a conspicuous gap: we don’t have a good answer for how to evaluate these systems. The benchmarks we rely on — MMLU for general reasoning, SWE-Bench for code generation, HumanEval for function-level synthesis — measure how well a model handles a single, well-scoped input-output pair. They tell you something about raw model capability, but they’re misleading as proxies for agent performance. The temptation is to assume that if the underlying model evals look good, the agent evals will follow — after all, agents are just executing prompts, right?

This reasoning is partially true: agents build on LLM calls, and the quality of those calls matters. But an agent built on a model that aces MMLU can still lose coherent context five turns in. An agent backed by the top SWE-Bench model can still invoke the wrong tool half the time in your domain-specific workflow, pass stale context to a sub-agent, or fail to recover from an unexpected API error. The gap between “this model is generally capable” and “this agent reliably does what my users need” is where things break in production — and none of these failure modes are captured by evaluating prompts in isolation.

What Agentic Evals Actually Add

To be clear: agentic evals don’t replace everything you already know about ML quality. All the standard concerns — model selection, prompt engineering, output quality, latency, cost — still apply. You can’t skip them. What you can’t do is stop there and assume the agent will be fine. A mature agentic practice builds on that foundation and adds evaluation for the new failure modes: context degradation, tool misuse, planning breakdowns, coordination failures, and more. If you’re building agents in production, you need to measure these dimensions during development, monitor them in production, and practice resilience strategies that keep the agent functioning predictably when things go wrong — because they will.

About This Series

This series lays out the problem and offers a practical framework for agentic evaluation — one that accounts for the dimensions of complexity agents introduce, maps concrete metrics to each dimension, and acknowledges where the industry still doesn’t have good answers.

No Country for Old Benchmarks — Series thesis: why agentic evals need a new paradigm (this article).
The Usual Suspects — A canonical architecture for agentic systems, identifying the components where things go wrong.
Mission Impossible — The eight dimensions of complexity that make agentic evaluation fundamentally different. (Coming Soon)
Who Watches the Watchmen — LLM-as-a-judge, overall agent effectiveness, and retrieval quality metrics. (Coming Soon)
Inception — Evaluating planning quality: structural correctness, effectiveness, robustness, and grounding. (Coming Soon)
The Good, The Bad, and The Trajectory — Trajectory-level metrics that measure execution quality. (Coming Soon)
The Matrix — Routing quality: how agents choose which tools, models, and sub-agents to invoke. (Coming Soon)
Moneyball — Choosing the right metrics for your agent, plus resilience strategies for production. (Coming Soon)