Testing Layers¶
CheckAgent organizes tests into four layers, each with different cost, speed, and confidence tradeoffs.
Layer 1: Mock¶
Cost: Free | Speed: Milliseconds | When: Every commit
Replace LLMs and tools with deterministic mocks. Test your agent's logic without making any API calls.
@pytest.mark.agent_test(layer="mock")
async def test_agent_uses_correct_tool(ca_mock_llm, ca_mock_tool):
ca_mock_llm.on_input(contains="schedule").respond("I'll create an event.")
ca_mock_tool.on_call("create_event").respond({"id": "evt-1"})
result = await my_agent("Schedule a meeting", llm=ca_mock_llm, tools=ca_mock_tool)
assert_tool_called(result, "create_event")
Layer 2: Replay¶
Cost: Free (after recording) | Speed: Seconds | When: Every PR
Record real API responses as JSON cassettes, then replay them for deterministic regression tests.
@pytest.mark.agent_test(layer="replay")
@pytest.mark.cassette("tests/cassettes/booking_flow.json")
async def test_booking_regression(my_agent):
result = await my_agent.run("Book a flight to Tokyo")
assert result.succeeded
assert result.tool_was_called("search_flights")
Layer 3: Eval¶
Cost: Moderate | Speed: Seconds | When: On merge
Evaluate agent quality against golden datasets using metrics like task completion, tool correctness, and step efficiency.
from checkagent import load_dataset, task_completion
@pytest.mark.agent_test(layer="eval")
async def test_agent_quality(my_agent):
dataset = load_dataset("tests/golden/booking_cases.json")
for case in dataset.cases:
result = await my_agent.run(case.input)
score = task_completion(result, case.expected_output)
assert score.value >= 0.8, f"Failed on: {case.input}"
Layer 4: Judge¶
Cost: Expensive (LLM calls) | Speed: Minutes | When: Nightly
Use an LLM as a judge to evaluate subjective quality — helpfulness, accuracy, safety — with statistical assertions.
from checkagent import RubricJudge, Criterion
@pytest.mark.agent_test(layer="judge")
async def test_response_quality(my_agent, judge_llm):
judge = RubricJudge(
llm=judge_llm,
criteria=[
Criterion(name="helpfulness", description="Was the response helpful?"),
Criterion(name="accuracy", description="Was the information correct?"),
],
)
result = await my_agent.run("What's the capital of France?")
score = await judge.evaluate(result)
assert score.passed
Choosing a Layer¶
| Question | Layer |
|---|---|
| Does my agent call the right tools? | Mock |
| Did a code change break existing behavior? | Replay |
| Is my agent good enough on a benchmark? | Eval |
| Is the output actually helpful/correct/safe? | Judge |
Start with mock tests. They're free, fast, and catch most logic bugs. Add higher layers as your agent matures.