Behavioral Testing for Non-Deterministic AI Agents¶

Agent outputs are non-deterministic. Test decision quality and end-state, not exact execution paths, and define acceptable behavioral variance as a product decision.

Why Traditional Testing Breaks Down¶

Traditional tests assert exact outputs for given inputs. Agents produce different valid outputs for identical inputs — different tool call sequences, phrasings, and solution paths. Equality checks then yield false negatives on correct behavior and false positives on lucky runs.

Behavioral testing replaces "did the agent produce output X?" with "did the agent make good decisions and reach a valid end-state?" [Source: Demystifying Evals for AI Agents]

Why It Works¶

Agents solve problems through search — selecting tools, observing results, updating plans. One task admits many valid paths because the solution space is under-constrained. End-state evaluation removes the path constraint: if the final state meets acceptance criteria, the agent succeeded regardless of route. [Source: Demystifying Evals for AI Agents]

Separate Deterministic from Agentic Components¶

Not every part of an agent system needs behavioral testing. A capability matrix isolates what to test and how:

Component type	Testing method	Example
Deterministic	Traditional unit/integration tests	Tool input parsing, output formatting, API call construction
Agentic	Behavioral evaluation	Decision-making, tool selection, multi-step reasoning

Mock tools to test agent reasoning without external dependencies. Tool output quality — concise, filtered, well-formatted — also needs evaluation: tool responses shape the context the agent reasons over downstream.

Three Grading Methods¶

Use the lightest method that covers each case:

Method	Best for	Trade-off
Code-based	Exact match, regex, test suite pass/fail	Fastest and most reliable, but limited to verifiable outputs
LLM-as-judge	Open-ended outputs, style, completeness	Scalable and consistent with human judgment, but requires calibration
Human grading	Ambiguous edge cases, novel failure modes	Most flexible, but slowest — avoid when possible

LLM-as-Judge¶

For free-form outputs, a calibrated LLM judge with a structured rubric approximates human judgment. Define scoring dimensions explicitly:

RUBRIC = """Score the agent's response on each dimension (0.0-1.0):
- factual_accuracy: Are claims correct and supported?
- completeness: Does it address the full query?
- tool_efficiency: Were tools used appropriately (no redundant calls)?

Respond with JSON: {"scores": {...}, "pass": true/false, "explanation": "..."}"""

Track precision and recall of LLM graders against human assessments, and avoid class-imbalanced eval sets that distort headline accuracy. [Source: Demystifying Evals for AI Agents]

Three-Part Eval Foundation¶

Every agent eval system needs three components working in a feedback loop:

graph LR
    A[Agent under test] -->|runs against| B[Representative dataset]
    B -->|outputs graded by| C[Scorer library]
    C -->|regressions fed back to| A

Representative dataset: Start with ~20 queries. Small samples catch dramatic effect sizes (e.g., 30% to 80% from a prompt change) without a large dataset upfront. [Source: Demystifying Evals for AI Agents]

Scorer library: Reusable grading functions — code checkers, LLM rubric evaluators, composite scorers — each returning a structured result.

Feedback loop: Every model, prompt, or tool change runs through the same dataset and scorers, catching regressions before deployment.

Define Acceptable Variance¶

Pass rate thresholds are not fixed at 100% — they depend on the failures you tolerate. This is a product decision.

File editing agent: 95% acceptable (formatting differences tolerable)
Security scanning agent: 99.5% minimum (missed vulnerabilities are not tolerable)
Research summarization agent: 85% acceptable (phrasing variance expected)

When pass rates drop below threshold, the eval suite blocks deployment. Revisit thresholds as capabilities evolve.

Evaluate End-State, Not Process¶

For agents that modify persistent state across turns, evaluate final outcomes. A longer path that reaches the correct state beats a shorter path that does not. See Grade Agent Outcomes, Not Execution Paths for implementation.

Cognition operationalizes this in Devin's test mode: rather than asserting on the execution path, the agent spins up the app in its own VM, clicks through it, and confirms the change works "the same way an engineer would," each run scaling out in parallel on its own dev server. [Source: Cognition: Verifying Agentic Development at Scale]

Example¶

A minimal behavioral eval that combines code-based and LLM-based grading for a coding agent:

import subprocess, json, anthropic

client = anthropic.Anthropic()

def grade_deterministic(repo_path: str, test_file: str) -> dict:
    """Code-based grading: does the test suite pass?"""
    result = subprocess.run(
        ["python", "-m", "pytest", test_file, "-q"],
        cwd=repo_path, capture_output=True, text=True,
    )
    return {"method": "code", "passed": result.returncode == 0}

def grade_behavioral(question: str, output: str, rubric: str) -> dict:
    """LLM-as-judge grading: does the output meet behavioral criteria?"""
    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=256,
        messages=[{"role": "user", "content": (
            f"Evaluate this agent output.\n\n"
            f"Task: {question}\nOutput: {output}\nRubric: {rubric}\n\n"
            f"Respond with JSON: {{\"score\": 0.0-1.0, \"pass\": true/false, "
            f"\"explanation\": \"...\"}}"
        )}],
    )
    return {"method": "llm", **json.loads(response.content[0].text)}

# Combine both methods for a complete behavioral eval
deterministic = grade_deterministic("./repo", "tests/test_feature.py")
behavioral = grade_behavioral(
    question="Refactor the user service to use dependency injection",
    output=open("./repo/services/user.py").read(),
    rubric="Uses constructor injection, no global state, testable in isolation",
)
overall_pass = deterministic["passed"] and behavioral.get("pass", False)
print(f"Deterministic: {deterministic['passed']} | Behavioral: {behavioral}")

Multi-Agent Considerations¶

Small prompt changes in one agent unpredictably alter subagent behavior. [Source: Multi-Agent Research System] Monitor interaction patterns and establish golden trajectory baselines to catch regressions across decision points.

When This Backfires¶

Behavioral testing pays off only when outputs are genuinely non-deterministic:

Constrained function-calling agents: Structured JSON with a fixed schema needs equality checks. LLM grading adds cost without signal.
High-volume regression suites: LLM-as-judge at thousands of cases per CI run is slow and expensive. Reserve it for the agentic layer; code-check structured outputs at scale. Grader cost at scale is itself a domain-grounded decision — LangChain describes co-designing efficient verifiers with Harvey for legal agents, trading verifier cost against domain economics. [Source: Designing Efficient Verifiers for Legal Agents]
Uncalibrated thresholds: Thresholds set without real failure data either block valid outputs or pass defective ones.
Uncalibrated LLM judge: An LLM grader not calibrated against human experts introduces systematic bias that invalidates the eval pipeline.

Key Takeaways¶

Separate deterministic from agentic components using a capability matrix
Use code-based grading first, LLM-as-judge for open-ended outputs, human grading as a last resort
Start with ~20 representative queries — small samples catch large effect sizes
Define pass rate thresholds as a product decision, not an engineering target
Evaluate end-state and decision quality, not execution paths
In multi-agent systems, monitor cross-agent interaction patterns

Grade Agent Outcomes, Not Execution Paths
Golden Query Pairs as Continuous Regression Tests
Eval-Driven Development
LLM-as-Judge Evaluation
pass@k Metrics
CoT Robustness in Code Generation — A/B measurement of chain-of-thought as a concrete behavioral-test application
Anti-Reward-Hacking: Rubrics That Resist Gaming
Incident-to-Eval Synthesis: Production Failures as Evals