Skip to content

Eval & Assertions

Metrics, structured assertions, datasets, and cost tracking.

Assertions

assert_tool_called

assert_tool_called(result, tool_name, *, call_index=None, **expected_args)

Assert a tool was called with specific argument patterns.

Parameters:

Name Type Description Default
result AgentRun

The AgentRun to inspect.

required
tool_name str

Name of the tool that should have been called.

required
call_index int | None

If given, check a specific call (0-based). Otherwise checks that at least one call matches all patterns.

None
**expected_args Any

Argument patterns to match. Values can be exact matches or dirty-equals matchers.

{}

Returns:

Type Description
ToolCall

The matching ToolCall.

Raises:

Type Description
StructuredAssertionError

If no matching call is found.

assert_output_schema

assert_output_schema(result, model, *, strict=False)

Validate that the agent's output parses into the given Pydantic model.

Parameters:

Name Type Description Default
result AgentRun | Any

An AgentRun (uses final_output) or a raw value.

required
model type[T]

The Pydantic model class to validate against.

required
strict bool

If True, use Pydantic's strict mode (no coercion).

False

Returns:

Type Description
T

The validated Pydantic model instance.

Raises:

Type Description
StructuredAssertionError

If validation fails, with field-level detail.

assert_output_matches

assert_output_matches(result, pattern)

Assert output matches a partial/fuzzy pattern.

Works with plain values for exact matching, or with dirty-equals matchers for flexible structural matching.

Parameters:

Name Type Description Default
result AgentRun | Any

An AgentRun (uses final_output) or a raw value.

required
pattern dict[str, Any] | Any

A dict of field patterns (supports dirty-equals matchers), or any value for direct comparison.

required

Raises:

Type Description
StructuredAssertionError

If any field doesn't match the pattern.

assert_json_schema

assert_json_schema(output, schema)

Validate output against a JSON Schema.

Parameters:

Name Type Description Default
output Any

The value to validate (dict, list, or JSON string).

required
schema dict[str, Any]

A JSON Schema dict.

required

Raises:

Type Description
StructuredAssertionError

If validation fails.

ImportError

If jsonschema is not installed.

StructuredAssertionError

StructuredAssertionError(message, *, details=None)

Bases: AssertionError

Rich assertion error with structured diff information.

Metrics

Built-in evaluation metrics for agent runs.

Provides deterministic metrics for evaluating agent performance: - TaskCompletion: Did the agent achieve the stated goal? - ToolCorrectness: Were the right tools called? (precision, recall, F1) - StepEfficiency: How many steps vs. optimal? (ratio) - TrajectoryMatch: Does the step sequence match expected?

Requirements: F3.1

task_completion(run, *, expected_output_contains=None, expected_output_equals=None, check_no_error=True, threshold=1.0)

Score task completion based on output content and success.

Computes a score from 0.0 to 1.0 based on: - Whether the run completed without errors (if check_no_error=True) - Whether the output contains expected substrings - Whether the output exactly matches expected value

Parameters:

Name Type Description Default
run AgentRun

The agent run to evaluate.

required
expected_output_contains list[str] | None

Substrings that must appear in the output.

None
expected_output_equals str | None

Exact expected output string.

None
check_no_error bool

Whether to check that the run had no errors.

True
threshold float

Score threshold for pass/fail (default 1.0).

1.0

Returns:

Type Description
Score

Score with value between 0.0 and 1.0.

tool_correctness(run, *, expected_tools, threshold=0.5)

Score tool usage with precision, recall, and F1.

Compares the set of tools actually called against the expected set.

Parameters:

Name Type Description Default
run AgentRun

The agent run to evaluate.

required
expected_tools list[str]

List of tool names that should have been called.

required
threshold float

F1 score threshold for pass/fail (default 0.5).

0.5

Returns:

Type Description
Score

Score with F1 value and precision/recall in metadata.

step_efficiency(run, *, optimal_steps, threshold=0.5)

Score step efficiency as ratio of optimal to actual steps.

A score of 1.0 means the agent used the optimal number of steps. Scores decrease as the agent uses more steps than optimal.

Parameters:

Name Type Description Default
run AgentRun

The agent run to evaluate.

required
optimal_steps int

The minimum number of steps needed.

required
threshold float

Efficiency ratio threshold for pass/fail (default 0.5).

0.5

Returns:

Type Description
Score

Score with efficiency ratio.

trajectory_match(run, *, expected_trajectory, mode='ordered', threshold=1.0)

Score whether the agent's tool call sequence matches expected.

Supports three matching modes: - "strict": Exact sequence match (same tools in same order, same count) - "ordered": Expected tools appear in order (allows extra tools between) - "unordered": All expected tools were called (any order)

Parameters:

Name Type Description Default
run AgentRun

The agent run to evaluate.

required
expected_trajectory list[str]

Ordered list of expected tool names.

required
mode str

Matching mode — "strict", "ordered", or "unordered".

'ordered'
threshold float

Score threshold for pass/fail (default 1.0).

1.0

Returns:

Type Description
Score

Score with match value.

Raises:

Type Description
ValueError

If mode is not one of "strict", "ordered", "unordered".

Datasets

GoldenDataset

GoldenDataset

Bases: BaseModel

A collection of test cases forming a golden dataset.

Includes optional metadata about the dataset itself (name, version, etc.).

filter_by_tags(*tags)

Return cases matching any of the given tags.

get_case(case_id)

Look up a single test case by ID.

EvalCase

EvalCase

Bases: BaseModel

A single test case in a golden dataset.

Example YAML/JSON: id: refund-001 input: "I want to return order #12345" expected_tools: [lookup_order, check_return_policy, initiate_refund] expected_output_contains: ["refund initiated", "3-5 business days"] max_steps: 8 tags: [refund, happy-path]

load_dataset

load_dataset(path)

Load and validate a golden dataset from a file.

Parameters:

Name Type Description Default
path str | Path

Path to a JSON or YAML file containing test cases.

required

Returns:

Type Description
GoldenDataset

A validated GoldenDataset instance.

Raises:

Type Description
FileNotFoundError

If the file does not exist.

ValueError

If the file format is unsupported or data is invalid.

ValidationError

If test cases fail schema validation.

load_cases

load_cases(path, tags=None)

Load test cases from a file, optionally filtered by tags.

Convenience function that returns just the list of EvalCase objects.

Parameters:

Name Type Description Default
path str | Path

Path to a JSON or YAML file.

required
tags list[str] | None

If provided, only return cases matching any of these tags.

None

Returns:

Type Description
list[EvalCase]

List of validated EvalCase instances.

Cost Tracking

CostTracker

CostTracker(budget=None, pricing_overrides=None, default_pricing=None)

Tracks cumulative costs across multiple agent runs with budget enforcement.

Usage

tracker = CostTracker(budget=config.budget, pricing_overrides=...)

After each test run:

breakdown = tracker.record(run)

Check budget:

tracker.check_budget() # raises BudgetExceededError if over

record(run)

Calculate cost for a run and add it to the cumulative tracker.

check_test_budget(breakdown)

Check if a single test's cost exceeds the per-test budget.

Raises BudgetExceededError if over limit.

check_suite_budget()

Check if cumulative cost exceeds the per-suite budget.

Raises BudgetExceededError if over limit.

check_ci_budget()

Check if cumulative cost exceeds the per-CI-run budget.

Raises BudgetExceededError if over limit.

summary()

Generate a summary report of all tracked runs.

CostReport

CostReport(run_count=0, total_input_tokens=0, total_output_tokens=0, total_cost=0.0, per_model=dict(), unpriced_steps=0, budget=BudgetConfig()) dataclass

Summary cost report across multiple runs.

budget_utilization()

Return budget utilization as fractions (0.0-1.0+) for each limit.

CostBreakdown

CostBreakdown(total_input_tokens=0, total_output_tokens=0, total_cost=0.0, per_model=dict(), unpriced_steps=0) dataclass

Full cost breakdown for a single agent run.

calculate_run_cost

calculate_run_cost(run, pricing_overrides=None, default_pricing=None)

Calculate cost breakdown for an entire agent run.

Pricing resolution order per step: 1. Step's model in pricing_overrides 2. Step's model in BUILTIN_PRICING 3. default_pricing (fallback)

If no pricing is found for a step, it is counted as unpriced.

BudgetExceededError

BudgetExceededError(limit_name, limit_usd, actual_usd)

Bases: Exception

Raised when a cost budget limit is exceeded.