CheckAgent¶

The open-source testing framework for AI agents. pytest-native. Self-hosted. Zero telemetry. Apache-2.0 forever.

CheckAgent demo — run tests and safety scans in seconds

Try it now — no install

System Prompt Analyzer → — Paste your agent's system prompt and instantly see which OWASP LLM Top 10 security controls it addresses. Runs in the browser, no API keys needed.

CheckAgent is a pytest plugin for testing AI agent workflows. It provides layered testing — from free, millisecond unit tests to LLM-judged evaluations with statistical rigor — so you can ship agents with the same confidence you ship traditional software.

Why CheckAgent?¶

pytest-native — tests are .py files, assertions are assert, no new DSL to learn. If you know pytest, you already know CheckAgent.
Self-hosted — runs entirely on your infrastructure. No cloud accounts, no SaaS dependencies, no vendor lock-in.
Zero telemetry — nothing is phoned home, ever. The code is Apache-2.0 and fully auditable. Read it yourself.
Apache-2.0 forever — no open-core, no bait-and-switch, no "community edition" restrictions. The license is in the repo.
Framework-agnostic — works with LangChain, OpenAI Agents SDK, CrewAI, PydanticAI, Anthropic SDK, or any Python callable.
Safety testing built-in — 101 attack probes covering prompt injection, PII leakage, jailbreak attempts, tool scope violations, data enumeration, and groundedness ship as core, not an add-on.
Cost tracking built-in — every test run tracks token usage and estimated cost. Set budget limits so eval and judge layers don't surprise you.
Async-first — most agent frameworks are async; CheckAgent is too. async def test_* just works.

The Testing Pyramid¶

                  ╱‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾╲
                 │   JUDGE  · $$$     │          Minutes · Nightly
                 │   LLM-as-judge     │
                ╱‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾╲
               │   EVAL  · $$          │         Seconds · On merge
               │   Metrics & datasets  │
              ╱‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾╲
             │   REPLAY  · $              │      Seconds · On PR
             │   Record & replay          │
            ╱‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾╲
           │   MOCK  · Free                  │   Milliseconds · Every commit
           │   Deterministic unit tests      │
            ╲_______________________________╱

Start with MOCK — free, milliseconds, deterministic, no API calls. Add REPLAY for record-and-replay regression tests that run in seconds. Add EVAL to measure quality metrics against golden datasets. Add JUDGE for LLM-as-judge assertions on subjective quality. Run each layer at the right frequency so you're not paying for LLM calls on every commit.

Four Layers, One Framework¶

Layer	Cost	Speed	When to run
MOCK	Free	Milliseconds	Every commit
REPLAY	Cheap	Seconds	Every PR
EVAL	Moderate	Seconds	On merge
JUDGE	Expensive	Minutes	Nightly

Safety Testing¶

CheckAgent ships 101 attack probes across six categories:

Prompt injection — direct and indirect injection via tool outputs, user messages, and system prompt overrides
PII leakage — tests whether your agent exposes names, emails, phone numbers, SSNs, and other sensitive data
Jailbreak — role-play attacks, encoding tricks, and instruction-override attempts
Scope violations — tool boundary tests checking whether your agent calls tools it shouldn't
Data enumeration — attempts to extract or enumerate data beyond authorized scope
Groundedness — tests whether your agent fabricates or hallucinates information not grounded in context

@pytest.mark.agent_test(layer="mock")
async def test_no_prompt_injection(my_agent, ca_safety):
    await ca_safety.assert_no_injection(my_agent, probe="ignore_previous_instructions")

Get Started in 60 Seconds¶

pip install checkagent
checkagent demo

No API keys, no configuration. The demo runs 8 tests across mock, eval, and safety layers and shows you what CheckAgent can do.

Ready to test your own agent? See the Quickstart.