Why Agent Testing Is Fundamentally Harder Than LLM Testing
If you have been running evals on standalone LLM calls, you already have a head start. But the moment you graduate from a single prompt-response pair to a multi-step agent that selects tools, manages state across turns, and makes branching decisions, the testing surface area explodes. Most teams discover this the hard way: their LLM evals all pass, but their agent still fails catastrophically in production because the failures happen between the LLM calls, not inside them.
Here is what makes agent testing genuinely different. First, agents are non-deterministic at multiple levels. The LLM itself produces variable outputs, but the agent also makes variable decisions about which tools to call, in what order, and with what parameters. A single user request can produce ten different valid execution traces. Your tests need to accommodate that variability without becoming so loose they catch nothing.
Second, agents have state. They accumulate context across steps, store intermediate results, update working memory, and sometimes persist state across sessions. A bug in step three might only manifest because of a subtle state corruption in step one. Traditional unit tests that check a single function in isolation miss these cascading failures entirely.
Third, tool use introduces real side effects. When your agent calls an API, writes to a database, or sends an email, those actions have consequences that are expensive or impossible to reverse. You cannot just rerun a failing test and hope for the best. You need sandbox environments, mock tools, and careful isolation strategies.
Fourth, multi-agent systems add coordination complexity. When Agent A delegates a subtask to Agent B, you now have two non-deterministic systems communicating through natural language. The failure modes multiply: miscommunication, deadlocks, conflicting actions, infinite loops, and subtle semantic drift where each agent gradually misinterprets what the other meant. Testing this requires a fundamentally different approach than testing a single agent in isolation.
The teams that ship reliable agents treat testing as a first-class engineering discipline, not an afterthought bolted onto a demo. This guide walks through exactly how to build that discipline from the ground up.
The Four Layers of Agent Testing
You need four distinct layers of tests for a production agent system. Skipping any one of them leaves a blind spot that will eventually bite you. Think of these layers like the classic testing pyramid, adapted for the unique challenges of agentic AI.
Unit Tests for Individual Tools
Every tool your agent can invoke needs its own test suite, independent of the agent. If you have a database query tool, test that it handles malformed queries gracefully, respects permission boundaries, returns results in the expected schema, and fails cleanly when the database is unavailable. If you have a web search tool, test that it parses results correctly, handles rate limits, and sanitizes inputs. These tests are fast, deterministic, and cheap to run. They should live in your standard test suite and run on every commit.
The mistake I see most often is teams testing tools only through the agent. That means every tool test also depends on the LLM generating the right tool call, which makes failures ambiguous. Did the tool break, or did the LLM call it wrong? Isolate the layers.
Integration Tests for Tool Chains
Once individual tools work correctly, you need to test them in combination. An integration test might verify that your agent can successfully search for a document, extract relevant information, and compose an email with that information. The key question is whether data flows correctly between tools and whether the agent handles intermediate failures (what happens when the search returns no results, or the email service is down?).
For integration tests, I recommend using a fixed LLM response sequence rather than a live model. Record a known-good execution trace, then replay the LLM responses while running the real tool chain. This gives you determinism in the LLM layer while still exercising real tool integration. When these tests break, you know the issue is in your tool infrastructure, not in model behavior.
End-to-End Scenario Tests
Scenario tests exercise the full agent pipeline with a live model. You give the agent a realistic user request and verify that the final outcome is correct. These are the most valuable tests for catching real production failures, and also the most expensive to run. Each test invokes one or more LLM calls, which cost money and take time.
Build your scenario suite around your golden dataset: a curated set of user requests paired with expected outcomes. For agents, the expected outcome is usually not an exact string match. Instead, you verify properties: Did the agent complete the task? Did it call the right tools? Did it produce a result that satisfies a rubric? Did it stay within cost and latency budgets? We cover golden dataset construction in our LLM evaluation guide, and the same principles apply to agents with the added complexity of multi-step verification.
Regression Tests
Every production failure should become a regression test. When you discover that your agent mishandles a specific edge case (say, a user request with ambiguous tool selection), you add that exact scenario to your test suite with the correct expected behavior. Over time, this regression suite becomes your most valuable testing asset because it encodes every hard lesson you have learned. Tag regression tests with the incident that spawned them so you can trace why each test exists.
Evaluation Frameworks: What to Use in 2026
The evaluation tooling landscape has matured significantly, but no single tool covers everything you need for agent testing. Here is my honest, opinionated breakdown of the major options and how they fit together.
Braintrust
Braintrust remains the most polished general-purpose eval platform. For agent testing specifically, its experiment tracking is excellent. You can log full execution traces (every LLM call, tool invocation, and intermediate state), define custom scoring functions that evaluate multi-step behavior, and compare experiments side by side. The SDK makes it straightforward to wrap your agent's execution loop and capture structured traces. Where Braintrust falls short is in agent-specific primitives. You still need to write your own logic for evaluating tool selection accuracy or step efficiency. But as a foundation for running and tracking evals, it is the strongest option.
Langfuse
Langfuse is my go-to recommendation for teams that want observability and evaluation in a single stack, especially if you need to self-host. Its tracing model maps naturally to agent execution: you create a trace for the overall task, nest spans for each agent step, and attach scores at any level. The dataset and annotation features let you build golden datasets directly from production traces, which is the fastest way to bootstrap an agent eval suite. The open-source nature means you can extend it when the built-in scoring does not cover your agent-specific metrics.
Arize Phoenix
Phoenix has carved out a strong niche for trace-level debugging and evaluation. Its UI for exploring agent execution traces is genuinely best-in-class. You can visualize the full decision tree your agent took, inspect tool call parameters at each step, and identify where the agent diverged from the expected path. For teams that are debugging agent failures more than running automated eval suites, Phoenix is the right starting point. It also has solid LLM-as-judge integration for scoring individual steps within a trace.
Patronus AI
Patronus focuses on automated quality evaluation with pre-built judges for hallucination detection, toxicity, PII leakage, and factual consistency. For agent testing, its value is in output validation: after your agent completes a task, Patronus can automatically verify that the final output meets quality standards without you writing custom judge prompts from scratch. The pre-built evaluators save significant time if your quality concerns align with their catalog. For highly custom agent behaviors, you will still need supplementary evaluation logic.
Custom Eval Harnesses
For serious agent testing, you will inevitably need some custom evaluation code. No off-the-shelf tool knows how to score whether your specific agent selected the optimal tool sequence for a given task, or whether it managed state correctly across a 12-step workflow. My recommendation is to use one of the platforms above as your infrastructure layer (trace storage, experiment tracking, dashboards) and build your scoring logic in Python or TypeScript. A typical custom harness includes a test runner that executes agent scenarios, a set of scoring functions that evaluate traces against rubrics, and a reporter that pushes results to your eval platform. Keep it simple. A 200-line Python script that scores five key dimensions is worth more than an elaborate framework nobody maintains.
My default stack for agent evaluation in 2026: Langfuse for tracing and production monitoring, Braintrust for experiment tracking during development, custom Python scorers for agent-specific metrics, and Promptfoo in CI for fast regression checks.
Building Test Datasets and Choosing Metrics That Matter
Your evaluation is only as good as your test data. For agents, building a useful dataset requires more thought than for simple LLM tasks because you need to capture not just input-output pairs but expected behaviors across multi-step execution.
Golden Dataset Construction
A golden dataset for agent testing should include three components for each test case: the user request, the expected final outcome (with a rubric, not an exact match), and optionally the expected execution trace or constraints on the trace. For example, a test case for a research agent might specify: the user asks about Q2 revenue trends, the final report should mention at least three specific data points, the agent should use the database query tool before the summarization tool, and the total execution should complete in under 30 seconds.
Start with 50 to 100 hand-curated scenarios covering your core use cases. Stratify by complexity (simple single-tool tasks, multi-tool chains, ambiguous requests requiring clarification) and by risk level (tasks where failure is merely inconvenient versus tasks where failure has financial or safety consequences). Mine production logs aggressively. Real user requests are always stranger and more diverse than what your team invents in a brainstorming session.
Adversarial and Edge Cases
Agents face adversarial inputs that go beyond standard prompt injection. Include test cases where the user provides contradictory instructions, where tool responses contain unexpected formats, where tools fail mid-execution, where the user changes their mind partway through, and where the optimal action is to refuse the request entirely. Also include cases that probe the boundaries of your agent's capabilities. If your agent is not supposed to send emails, include a test case where the user explicitly asks it to send an email, and verify it declines gracefully.
Metrics That Actually Drive Decisions
Most teams track too many metrics and act on none of them. For agent evaluation, focus on these five:
- Task completion rate. What percentage of user requests does the agent resolve successfully? This is your north star. Measure it weekly and treat any sustained drop as an incident.
- Tool selection accuracy. When the agent picks a tool, was it the right tool for the situation? Measure this by comparing the agent's tool selections against your golden dataset's expected traces. Low tool selection accuracy usually means your tool descriptions are ambiguous or your agent's planning prompt needs work.
- Step efficiency. How many steps does the agent take compared to the optimal path? An agent that reaches the right answer but takes 15 steps instead of 4 is wasting money and user patience. Track the ratio of actual steps to optimal steps.
- Cost per task. Sum up all LLM inference costs, tool execution costs, and any external API charges for each task. Set a budget ceiling per task type and alert when the agent exceeds it. Runaway costs are a common failure mode when agents get stuck in loops.
- End-to-end latency. How long does the user wait? Agents are inherently slower than single LLM calls because of sequential tool execution. Track P50, P95, and P99 latency. If P95 exceeds your users' patience threshold, you need to optimize the agent's planning or parallelize tool calls.
Every other metric is secondary. Track them if you want, but make decisions based on these five.
Testing Tool Use: Mocks, Sandboxes, and Multi-Agent Coordination
How you handle tool execution in tests is one of the most consequential design decisions in your testing strategy. Get it wrong and you end up with tests that are either unrealistically easy to pass (because mocks are too forgiving) or prohibitively expensive and flaky (because real tools introduce external dependencies).
Mock Tools vs Real Tools
For unit and integration tests, mock tools are the right default. A mock tool returns a predetermined response for a given input, removing external dependencies and making tests deterministic. But your mocks need to be realistic. A mock database tool that always returns perfectly formatted results teaches your agent nothing about handling empty result sets, malformed schemas, or connection timeouts. Build your mocks with configurable failure modes. For each mock tool, support at least four scenarios: success with typical data, success with edge-case data, transient failure (retryable), and permanent failure.
For end-to-end scenario tests, use real tools in a sandbox environment. Spin up a dedicated test database with synthetic data, point your API tools at staging endpoints, and configure file system tools to use isolated temporary directories. The sandbox should be reset between test runs to prevent state leakage. Docker Compose works well for this. Define a test profile that starts all your tool dependencies with test data pre-loaded.
Sandbox Environment Design
Your sandbox needs to satisfy three properties: isolation (tests cannot affect production systems or each other), reproducibility (the same test produces the same result when external factors are controlled), and fidelity (the sandbox behaves enough like production that passing tests are meaningful). The hardest part is fidelity. If your production agent queries a Postgres database with 10 million rows and your test database has 50 rows, the agent might succeed in testing but fail in production because the LLM generates queries that are too slow on real data. Invest in realistic synthetic data. Tools like Faker and Gretel can generate production-scale datasets that mirror your real data distribution without containing actual customer information.
Testing Multi-Agent Systems
Multi-agent coordination failures are some of the hardest bugs to reproduce and the most damaging in production. When Agent A delegates a subtask to Agent B, you need to test several failure scenarios: Agent B returns an incorrect result, Agent B times out, Agent B partially completes the task and then fails, Agent B succeeds but returns results in an unexpected format, and Agent A and Agent B enter an infinite delegation loop.
The most effective testing pattern for multi-agent systems is contract testing. Define explicit contracts for inter-agent communication: what request format does Agent B expect, what response schema does it guarantee, what error codes can it return? Test each agent independently against these contracts, then run integration tests that exercise the actual communication channel. This mirrors how microservices teams test API contracts, and for good reason. Multi-agent systems have the same coordination challenges as distributed microservices, with the added unpredictability of LLM-generated messages. For more on building reliable guardrails around these interactions, see our guide on AI guardrails implementation.
CI/CD Integration and Regression Detection
Evals that do not block bad code from merging are decoration. The entire point of investing in agent testing is to catch regressions before they reach production. Here is how to wire your eval suite into CI/CD so it actually prevents failures.
Running Evals on Every PR
Use path-based triggers to run the right level of testing for each change. If a PR modifies tool implementations, run tool unit tests and integration tests. If it changes agent prompts, planning logic, or model configuration, run the full scenario suite. If it touches unrelated frontend code, skip agent evals entirely. This tiering keeps CI fast for most changes while ensuring thorough testing when it matters.
For the scenario suite, cost is a real concern. A 100-scenario eval suite using GPT-4o or Claude Opus might cost 5 to 15 dollars per run. That adds up if you are running it on every push to a feature branch. Two strategies to manage this: first, run a fast "smoke test" subset (10 to 20 high-priority scenarios) on every push, and the full suite only on PRs targeting main. Second, cache aggressively. If a test case's inputs and the model version have not changed, reuse the cached result. Promptfoo handles caching well out of the box.
Regression Detection
Your CI pipeline should compare eval results against a baseline (typically the scores from the current main branch). Define clear regression thresholds: block the merge if task completion rate drops by more than 2 percentage points, if any previously passing high-severity test case starts failing, or if average cost per task increases by more than 20 percent. Post a detailed comparison comment on the PR showing which scenarios changed, with the actual agent outputs for any regressions. Reviewers should see the behavior difference, not just a number.
Practical CI Configuration
A GitHub Actions workflow for agent evals typically follows this structure: check out the code, spin up the sandbox environment (Docker Compose), install dependencies, run the eval harness against the sandbox, compare results to the baseline stored as a CI artifact, and post results as a PR comment. The entire flow should complete in under 10 minutes for the smoke suite. If your full suite takes longer than 30 minutes, you need to either reduce the number of scenarios (focus on the highest-value cases) or parallelize execution across multiple runners.
Store eval baselines as versioned artifacts in your CI system, not in the repository. This prevents merge conflicts when multiple PRs update baselines simultaneously. When a PR intentionally changes agent behavior (improving a prompt, switching models), the author should explicitly update the baseline with a justification in the PR description.
A/B Testing, Human Evaluation, and Monitoring Agent Drift
Your offline eval suite catches known failure modes, but production always surprises you. The final layer of your testing strategy covers live traffic: A/B testing new agent versions, incorporating human judgment, and detecting when agent quality degrades over time.
A/B Testing Agents in Production
When you are considering a significant change to your agent (new model, restructured planning prompt, additional tools), run it as an A/B test before a full rollout. Route a small percentage of traffic (5 to 10 percent) to the new version and compare metrics side by side. The tricky part with agent A/B tests is that agents have side effects. If the B variant sends a poorly written email to a real customer, you cannot undo that. Use progressive rollout with guardrails: start at 1 percent traffic with human review of every B-variant interaction, increase to 5 percent with automated quality checks, and only go to 50/50 once you have statistical confidence in the key metrics.
The metrics you compare in an A/B test should match your core five (task completion, tool accuracy, step efficiency, cost, latency), plus any domain-specific quality measures. Run the test long enough to get statistical significance. For agents with lower traffic volume, this might mean two to four weeks. Do not cut tests short because early results look promising. Agent failures are often long-tail events that take time to surface.
Human Evaluation Loops
LLM-as-judge scoring is powerful but imperfect. For agent tasks that involve nuanced quality judgments (was the agent's research thorough enough? was its communication tone appropriate?), you need periodic human evaluation. Set up a review queue that samples 5 to 10 percent of production agent interactions and routes them to domain experts for scoring. Use the same rubric dimensions as your automated judges so you can calibrate the judges against human scores over time.
The human evaluation loop also serves as your canary for evaluation drift. If your LLM judge starts disagreeing with humans on more than 20 percent of sampled cases, something has changed: maybe the model powering your judge was updated, maybe user behavior shifted, or maybe your rubric no longer captures what quality means for your use case. Either way, human evaluators catch this before your automated metrics become meaningless.
Monitoring Agent Drift
Agent behavior changes even when you do not touch the code. Model providers update weights, external APIs change response formats, user behavior evolves, and the data your agent retrieves from knowledge bases gets stale or corrupted. You need continuous monitoring to catch these silent regressions. Track your core metrics daily and set up alerts for sustained deviations. A single bad day might be noise. Three consecutive days of declining task completion rate is a signal.
Build a drift detection dashboard that overlays agent metrics with external events: model version changes from your provider, deployments, knowledge base updates, and traffic pattern shifts. When metrics drop, this overlay helps you quickly identify the cause. We walk through the observability infrastructure for this in our AI observability in production guide. The combination of automated evals, human review, and continuous monitoring creates a feedback loop where every production failure improves your test suite, and every test suite improvement prevents future production failures.
The Cost of Evaluation
Let me be honest about the cost dimension, because it surprises most teams. Running LLM-as-judge evaluations on every agent interaction is not cheap. If your agent handles 10,000 tasks per day and you judge each one with a 5-dimension rubric using Claude Sonnet, that is roughly 50,000 judge calls per day. At current pricing that is 50 to 150 dollars daily just for evaluation, on top of your actual agent inference costs. Scale to 100,000 tasks and you are looking at 1,500 dollars a day for eval alone.
The way to manage this is sampling. You do not need to judge every interaction. Sample 5 to 10 percent of production traffic for online evaluation, stratified by task type and complexity. Judge 100 percent of new or high-risk task categories until you have confidence in the agent's behavior. For CI evals, keep your golden dataset lean. 100 to 200 well-chosen scenarios with aggressive caching will catch most regressions without blowing your budget. The teams that get evaluation costs under control treat eval budget as a line item in their AI infrastructure spend, planned and tracked just like inference costs.
If you are building AI agents and want help setting up a testing and evaluation pipeline that actually catches failures before your customers do, book a free strategy call with our team. We have built eval systems for agents across dozens of production deployments, and we can get you to confidence faster than building from scratch.
Need help building this?
Our team has launched 50+ products for startups and ambitious brands. Let's talk about your project.