Eval & Assertions¶

Metrics, structured assertions, datasets, and cost tracking.

Assertions¶

assert_tool_called¶

`assert_tool_called(result, tool_name, *, call_index=None, **expected_args)` ¶

Assert a tool was called with specific argument patterns.

Parameters:

Name	Type	Description	Default
`result`	`AgentRun`	The AgentRun to inspect.	required
`tool_name`	`str`	Name of the tool that should have been called.	required
`call_index`	`int \| None`	If given, check a specific call (0-based). Otherwise checks that at least one call matches all patterns.	`None`
`**expected_args`	`Any`	Argument patterns to match. Values can be exact matches or dirty-equals matchers.	`{}`

Returns:

Type	Description
`ToolCall`	The matching ToolCall.

Raises:

Type	Description
`StructuredAssertionError`	If no matching call is found.

assert_output_schema¶

`assert_output_schema(result, model, *, strict=False)` ¶

Validate that the agent's output parses into the given Pydantic model.

Parameters:

Name	Type	Description	Default
`result`	`AgentRun \| Any`	An AgentRun (uses final_output) or a raw value.	required
`model`	`type[T]`	The Pydantic model class to validate against.	required
`strict`	`bool`	If True, use Pydantic's strict mode (no coercion).	`False`

Returns:

Type	Description
`T`	The validated Pydantic model instance.

Raises:

Type	Description
`StructuredAssertionError`	If validation fails, with field-level detail.

assert_output_matches¶

`assert_output_matches(result, pattern)` ¶

Assert output matches a partial/fuzzy pattern.

Works with plain values for exact matching, or with dirty-equals matchers for flexible structural matching.

Parameters:

Name	Type	Description	Default
`result`	`AgentRun \| Any`	An AgentRun (uses final_output) or a raw value.	required
`pattern`	`dict[str, Any] \| Any`	A dict of field patterns (supports dirty-equals matchers), or any value for direct comparison.	required

Raises:

Type	Description
`StructuredAssertionError`	If any field doesn't match the pattern.

assert_json_schema¶

`assert_json_schema(output, schema)` ¶

Validate output against a JSON Schema.

Parameters:

Name	Type	Description	Default
`output`	`Any`	The value to validate (dict, list, or JSON string).	required
`schema`	`dict[str, Any]`	A JSON Schema dict.	required

Raises:

Type	Description
`StructuredAssertionError`	If validation fails.
`ImportError`	If jsonschema is not installed.

StructuredAssertionError¶

`StructuredAssertionError(message, *, details=None)` ¶

Bases: AssertionError

Rich assertion error with structured diff information.

Metrics¶

Built-in evaluation metrics for agent runs.

Provides deterministic metrics for evaluating agent performance: - TaskCompletion: Did the agent achieve the stated goal? - ToolCorrectness: Were the right tools called? (precision, recall, F1) - StepEfficiency: How many steps vs. optimal? (ratio) - TrajectoryMatch: Does the step sequence match expected?

Requirements: F3.1

`task_completion(run, *, expected_output_contains=None, expected_output_equals=None, check_no_error=True, threshold=1.0)` ¶

Score task completion based on output content and success.

Computes a score from 0.0 to 1.0 based on: - Whether the run completed without errors (if check_no_error=True) - Whether the output contains expected substrings - Whether the output exactly matches expected value

Parameters:

Name	Type	Description	Default
`run`	`AgentRun`	The agent run to evaluate.	required
`expected_output_contains`	`list[str] \| None`	Substrings that must appear in the output.	`None`
`expected_output_equals`	`str \| None`	Exact expected output string.	`None`
`check_no_error`	`bool`	Whether to check that the run had no errors.	`True`
`threshold`	`float`	Score threshold for pass/fail (default 1.0).	`1.0`

Returns:

Type	Description
`Score`	Score with value between 0.0 and 1.0.

`tool_correctness(run, *, expected_tools, threshold=0.5)` ¶

Score tool usage with precision, recall, and F1.

Compares the set of tools actually called against the expected set.

Parameters:

Name	Type	Description	Default
`run`	`AgentRun`	The agent run to evaluate.	required
`expected_tools`	`list[str]`	List of tool names that should have been called.	required
`threshold`	`float`	F1 score threshold for pass/fail (default 0.5).	`0.5`

Returns:

Type	Description
`Score`	Score with F1 value and precision/recall in metadata.

`step_efficiency(run, *, optimal_steps, threshold=0.5)` ¶

Score step efficiency as ratio of optimal to actual steps.

A score of 1.0 means the agent used the optimal number of steps. Scores decrease as the agent uses more steps than optimal.

Parameters:

Name	Type	Description	Default
`run`	`AgentRun`	The agent run to evaluate.	required
`optimal_steps`	`int`	The minimum number of steps needed.	required
`threshold`	`float`	Efficiency ratio threshold for pass/fail (default 0.5).	`0.5`

Returns:

Type	Description
`Score`	Score with efficiency ratio.

`trajectory_match(run, *, expected_trajectory, mode='ordered', threshold=1.0)` ¶

Score whether the agent's tool call sequence matches expected.

Supports three matching modes: - "strict": Exact sequence match (same tools in same order, same count) - "ordered": Expected tools appear in order (allows extra tools between) - "unordered": All expected tools were called (any order)

Parameters:

Name	Type	Description	Default
`run`	`AgentRun`	The agent run to evaluate.	required
`expected_trajectory`	`list[str]`	Ordered list of expected tool names.	required
`mode`	`str`	Matching mode — "strict", "ordered", or "unordered".	`'ordered'`
`threshold`	`float`	Score threshold for pass/fail (default 1.0).	`1.0`

Returns:

Type	Description
`Score`	Score with match value.

Raises:

Type	Description
`ValueError`	If mode is not one of "strict", "ordered", "unordered".

Datasets¶

GoldenDataset¶

`GoldenDataset` ¶

Bases: BaseModel

A collection of test cases forming a golden dataset.

Includes optional metadata about the dataset itself (name, version, etc.).

`filter_by_tags(*tags)` ¶

Return cases matching any of the given tags.

`get_case(case_id)` ¶

Look up a single test case by ID.

EvalCase¶

`EvalCase` ¶

Bases: BaseModel

A single test case in a golden dataset.

Example YAML/JSON: id: refund-001 input: "I want to return order #12345" expected_tools: [lookup_order, check_return_policy, initiate_refund] expected_output_contains: ["refund initiated", "3-5 business days"] max_steps: 8 tags: [refund, happy-path]

load_dataset¶

`load_dataset(path)` ¶

Load and validate a golden dataset from a file.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to a JSON or YAML file containing test cases.	required

Returns:

Type	Description
`GoldenDataset`	A validated GoldenDataset instance.

Raises:

Type	Description
`FileNotFoundError`	If the file does not exist.
`ValueError`	If the file format is unsupported or data is invalid.
`ValidationError`	If test cases fail schema validation.

load_cases¶

`load_cases(path, tags=None)` ¶

Load test cases from a file, optionally filtered by tags.

Convenience function that returns just the list of EvalCase objects.

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to a JSON or YAML file.	required
`tags`	`list[str] \| None`	If provided, only return cases matching any of these tags.	`None`

Returns:

Type	Description
`list[EvalCase]`	List of validated EvalCase instances.

Cost Tracking¶

CostTracker¶

`CostTracker(budget=None, pricing_overrides=None, default_pricing=None)` ¶

Tracks cumulative costs across multiple agent runs with budget enforcement.

Usage

tracker = CostTracker(budget=config.budget, pricing_overrides=...)

After each test run:¶

breakdown = tracker.record(run)

Check budget:¶

tracker.check_budget() # raises BudgetExceededError if over

`record(run)` ¶

Calculate cost for a run and add it to the cumulative tracker.

`check_test_budget(breakdown)` ¶

Check if a single test's cost exceeds the per-test budget.

Raises BudgetExceededError if over limit.

`check_suite_budget()` ¶

Check if cumulative cost exceeds the per-suite budget.

Raises BudgetExceededError if over limit.

`check_ci_budget()` ¶

Check if cumulative cost exceeds the per-CI-run budget.

Raises BudgetExceededError if over limit.

`summary()` ¶

Generate a summary report of all tracked runs.

CostReport¶

`CostReport(run_count=0, total_input_tokens=0, total_output_tokens=0, total_cost=0.0, per_model=dict(), unpriced_steps=0, budget=BudgetConfig())` `dataclass` ¶

Summary cost report across multiple runs.

`budget_utilization()` ¶

Return budget utilization as fractions (0.0-1.0+) for each limit.

CostBreakdown¶

`CostBreakdown(total_input_tokens=0, total_output_tokens=0, total_cost=0.0, per_model=dict(), unpriced_steps=0)` `dataclass` ¶

Full cost breakdown for a single agent run.

calculate_run_cost¶

`calculate_run_cost(run, pricing_overrides=None, default_pricing=None)` ¶

Calculate cost breakdown for an entire agent run.

Pricing resolution order per step: 1. Step's model in pricing_overrides 2. Step's model in BUILTIN_PRICING 3. default_pricing (fallback)

If no pricing is found for a step, it is counted as unpriced.

BudgetExceededError¶

`BudgetExceededError(limit_name, limit_usd, actual_usd)` ¶

Bases: Exception

Raised when a cost budget limit is exceeded.

Eval & Assertions¶

Assertions¶

assert_tool_called¶

assert_tool_called(result, tool_name, *, call_index=None, **expected_args) ¶

assert_output_schema¶

assert_output_schema(result, model, *, strict=False) ¶

assert_output_matches¶

assert_output_matches(result, pattern) ¶

assert_json_schema¶

assert_json_schema(output, schema) ¶

StructuredAssertionError¶

StructuredAssertionError(message, *, details=None) ¶

Metrics¶

task_completion(run, *, expected_output_contains=None, expected_output_equals=None, check_no_error=True, threshold=1.0) ¶

tool_correctness(run, *, expected_tools, threshold=0.5) ¶

step_efficiency(run, *, optimal_steps, threshold=0.5) ¶

trajectory_match(run, *, expected_trajectory, mode='ordered', threshold=1.0) ¶

Datasets¶

GoldenDataset¶

GoldenDataset ¶

filter_by_tags(*tags) ¶

get_case(case_id) ¶

EvalCase¶

EvalCase ¶

load_dataset¶

load_dataset(path) ¶

load_cases¶

load_cases(path, tags=None) ¶

Cost Tracking¶

CostTracker¶

CostTracker(budget=None, pricing_overrides=None, default_pricing=None) ¶

After each test run:¶

Check budget:¶

record(run) ¶

check_test_budget(breakdown) ¶

check_suite_budget() ¶

check_ci_budget() ¶

summary() ¶

CostReport¶

CostReport(run_count=0, total_input_tokens=0, total_output_tokens=0, total_cost=0.0, per_model=dict(), unpriced_steps=0, budget=BudgetConfig()) dataclass ¶

budget_utilization() ¶

CostBreakdown¶

CostBreakdown(total_input_tokens=0, total_output_tokens=0, total_cost=0.0, per_model=dict(), unpriced_steps=0) dataclass ¶

calculate_run_cost¶

calculate_run_cost(run, pricing_overrides=None, default_pricing=None) ¶

BudgetExceededError¶

BudgetExceededError(limit_name, limit_usd, actual_usd) ¶

`assert_tool_called(result, tool_name, *, call_index=None, **expected_args)` ¶

`assert_output_schema(result, model, *, strict=False)` ¶

`assert_output_matches(result, pattern)` ¶

`assert_json_schema(output, schema)` ¶

`StructuredAssertionError(message, *, details=None)` ¶

`task_completion(run, *, expected_output_contains=None, expected_output_equals=None, check_no_error=True, threshold=1.0)` ¶

`tool_correctness(run, *, expected_tools, threshold=0.5)` ¶

`step_efficiency(run, *, optimal_steps, threshold=0.5)` ¶

`trajectory_match(run, *, expected_trajectory, mode='ordered', threshold=1.0)` ¶

`GoldenDataset` ¶

`filter_by_tags(*tags)` ¶

`get_case(case_id)` ¶

`EvalCase` ¶

`load_dataset(path)` ¶

`load_cases(path, tags=None)` ¶

`CostTracker(budget=None, pricing_overrides=None, default_pricing=None)` ¶

`record(run)` ¶

`check_test_budget(breakdown)` ¶

`check_suite_budget()` ¶

`check_ci_budget()` ¶

`summary()` ¶

`CostReport(run_count=0, total_input_tokens=0, total_output_tokens=0, total_cost=0.0, per_model=dict(), unpriced_steps=0, budget=BudgetConfig())` `dataclass` ¶

`budget_utilization()` ¶

`CostBreakdown(total_input_tokens=0, total_output_tokens=0, total_cost=0.0, per_model=dict(), unpriced_steps=0)` `dataclass` ¶

`calculate_run_cost(run, pricing_overrides=None, default_pricing=None)` ¶

`BudgetExceededError(limit_name, limit_usd, actual_usd)` ¶