Eval & Assertions¶
Metrics, structured assertions, datasets, and cost tracking.
Assertions¶
assert_tool_called¶
assert_tool_called(result, tool_name, *, call_index=None, **expected_args)
¶
Assert a tool was called with specific argument patterns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
AgentRun
|
The AgentRun to inspect. |
required |
tool_name
|
str
|
Name of the tool that should have been called. |
required |
call_index
|
int | None
|
If given, check a specific call (0-based). Otherwise checks that at least one call matches all patterns. |
None
|
**expected_args
|
Any
|
Argument patterns to match. Values can be exact matches or dirty-equals matchers. |
{}
|
Returns:
| Type | Description |
|---|---|
ToolCall
|
The matching ToolCall. |
Raises:
| Type | Description |
|---|---|
StructuredAssertionError
|
If no matching call is found. |
assert_output_schema¶
assert_output_schema(result, model, *, strict=False)
¶
Validate that the agent's output parses into the given Pydantic model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
AgentRun | Any
|
An AgentRun (uses final_output) or a raw value. |
required |
model
|
type[T]
|
The Pydantic model class to validate against. |
required |
strict
|
bool
|
If True, use Pydantic's strict mode (no coercion). |
False
|
Returns:
| Type | Description |
|---|---|
T
|
The validated Pydantic model instance. |
Raises:
| Type | Description |
|---|---|
StructuredAssertionError
|
If validation fails, with field-level detail. |
assert_output_matches¶
assert_output_matches(result, pattern)
¶
Assert output matches a partial/fuzzy pattern.
Works with plain values for exact matching, or with dirty-equals matchers for flexible structural matching.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
AgentRun | Any
|
An AgentRun (uses final_output) or a raw value. |
required |
pattern
|
dict[str, Any] | Any
|
A dict of field patterns (supports dirty-equals matchers), or any value for direct comparison. |
required |
Raises:
| Type | Description |
|---|---|
StructuredAssertionError
|
If any field doesn't match the pattern. |
assert_json_schema¶
assert_json_schema(output, schema)
¶
Validate output against a JSON Schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output
|
Any
|
The value to validate (dict, list, or JSON string). |
required |
schema
|
dict[str, Any]
|
A JSON Schema dict. |
required |
Raises:
| Type | Description |
|---|---|
StructuredAssertionError
|
If validation fails. |
ImportError
|
If jsonschema is not installed. |
StructuredAssertionError¶
StructuredAssertionError(message, *, details=None)
¶
Bases: AssertionError
Rich assertion error with structured diff information.
Metrics¶
Built-in evaluation metrics for agent runs.
Provides deterministic metrics for evaluating agent performance: - TaskCompletion: Did the agent achieve the stated goal? - ToolCorrectness: Were the right tools called? (precision, recall, F1) - StepEfficiency: How many steps vs. optimal? (ratio) - TrajectoryMatch: Does the step sequence match expected?
Requirements: F3.1
task_completion(run, *, expected_output_contains=None, expected_output_equals=None, check_no_error=True, threshold=1.0)
¶
Score task completion based on output content and success.
Computes a score from 0.0 to 1.0 based on: - Whether the run completed without errors (if check_no_error=True) - Whether the output contains expected substrings - Whether the output exactly matches expected value
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run
|
AgentRun
|
The agent run to evaluate. |
required |
expected_output_contains
|
list[str] | None
|
Substrings that must appear in the output. |
None
|
expected_output_equals
|
str | None
|
Exact expected output string. |
None
|
check_no_error
|
bool
|
Whether to check that the run had no errors. |
True
|
threshold
|
float
|
Score threshold for pass/fail (default 1.0). |
1.0
|
Returns:
| Type | Description |
|---|---|
Score
|
Score with value between 0.0 and 1.0. |
tool_correctness(run, *, expected_tools, threshold=0.5)
¶
Score tool usage with precision, recall, and F1.
Compares the set of tools actually called against the expected set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run
|
AgentRun
|
The agent run to evaluate. |
required |
expected_tools
|
list[str]
|
List of tool names that should have been called. |
required |
threshold
|
float
|
F1 score threshold for pass/fail (default 0.5). |
0.5
|
Returns:
| Type | Description |
|---|---|
Score
|
Score with F1 value and precision/recall in metadata. |
step_efficiency(run, *, optimal_steps, threshold=0.5)
¶
Score step efficiency as ratio of optimal to actual steps.
A score of 1.0 means the agent used the optimal number of steps. Scores decrease as the agent uses more steps than optimal.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run
|
AgentRun
|
The agent run to evaluate. |
required |
optimal_steps
|
int
|
The minimum number of steps needed. |
required |
threshold
|
float
|
Efficiency ratio threshold for pass/fail (default 0.5). |
0.5
|
Returns:
| Type | Description |
|---|---|
Score
|
Score with efficiency ratio. |
trajectory_match(run, *, expected_trajectory, mode='ordered', threshold=1.0)
¶
Score whether the agent's tool call sequence matches expected.
Supports three matching modes: - "strict": Exact sequence match (same tools in same order, same count) - "ordered": Expected tools appear in order (allows extra tools between) - "unordered": All expected tools were called (any order)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
run
|
AgentRun
|
The agent run to evaluate. |
required |
expected_trajectory
|
list[str]
|
Ordered list of expected tool names. |
required |
mode
|
str
|
Matching mode — "strict", "ordered", or "unordered". |
'ordered'
|
threshold
|
float
|
Score threshold for pass/fail (default 1.0). |
1.0
|
Returns:
| Type | Description |
|---|---|
Score
|
Score with match value. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If mode is not one of "strict", "ordered", "unordered". |
Datasets¶
GoldenDataset¶
GoldenDataset
¶
EvalCase¶
EvalCase
¶
Bases: BaseModel
A single test case in a golden dataset.
Example YAML/JSON: id: refund-001 input: "I want to return order #12345" expected_tools: [lookup_order, check_return_policy, initiate_refund] expected_output_contains: ["refund initiated", "3-5 business days"] max_steps: 8 tags: [refund, happy-path]
load_dataset¶
load_dataset(path)
¶
Load and validate a golden dataset from a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to a JSON or YAML file containing test cases. |
required |
Returns:
| Type | Description |
|---|---|
GoldenDataset
|
A validated GoldenDataset instance. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file does not exist. |
ValueError
|
If the file format is unsupported or data is invalid. |
ValidationError
|
If test cases fail schema validation. |
load_cases¶
load_cases(path, tags=None)
¶
Load test cases from a file, optionally filtered by tags.
Convenience function that returns just the list of EvalCase objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str | Path
|
Path to a JSON or YAML file. |
required |
tags
|
list[str] | None
|
If provided, only return cases matching any of these tags. |
None
|
Returns:
| Type | Description |
|---|---|
list[EvalCase]
|
List of validated EvalCase instances. |
Cost Tracking¶
CostTracker¶
CostTracker(budget=None, pricing_overrides=None, default_pricing=None)
¶
Tracks cumulative costs across multiple agent runs with budget enforcement.
Usage
tracker = CostTracker(budget=config.budget, pricing_overrides=...)
After each test run:¶
breakdown = tracker.record(run)
Check budget:¶
tracker.check_budget() # raises BudgetExceededError if over
record(run)
¶
Calculate cost for a run and add it to the cumulative tracker.
check_test_budget(breakdown)
¶
Check if a single test's cost exceeds the per-test budget.
Raises BudgetExceededError if over limit.
check_suite_budget()
¶
Check if cumulative cost exceeds the per-suite budget.
Raises BudgetExceededError if over limit.
check_ci_budget()
¶
Check if cumulative cost exceeds the per-CI-run budget.
Raises BudgetExceededError if over limit.
summary()
¶
Generate a summary report of all tracked runs.
CostReport¶
CostReport(run_count=0, total_input_tokens=0, total_output_tokens=0, total_cost=0.0, per_model=dict(), unpriced_steps=0, budget=BudgetConfig())
dataclass
¶
Summary cost report across multiple runs.
budget_utilization()
¶
Return budget utilization as fractions (0.0-1.0+) for each limit.
CostBreakdown¶
CostBreakdown(total_input_tokens=0, total_output_tokens=0, total_cost=0.0, per_model=dict(), unpriced_steps=0)
dataclass
¶
Full cost breakdown for a single agent run.
calculate_run_cost¶
calculate_run_cost(run, pricing_overrides=None, default_pricing=None)
¶
Calculate cost breakdown for an entire agent run.
Pricing resolution order per step: 1. Step's model in pricing_overrides 2. Step's model in BUILTIN_PRICING 3. default_pricing (fallback)
If no pricing is found for a step, it is counted as unpriced.
BudgetExceededError¶
BudgetExceededError(limit_name, limit_usd, actual_usd)
¶
Bases: Exception
Raised when a cost budget limit is exceeded.