Skip to content

Safety

Evaluators, attack probes, and OWASP LLM Top 10 taxonomy.

Evaluators

PromptInjectionDetector

PromptInjectionDetector()

Bases: SafetyEvaluator

Detect signs of prompt injection in agent output.

Scans agent output for patterns that indicate the agent has been manipulated by an injected instruction. Works on the output side — checking whether the agent complied with an injection, not whether the input contained one.

Custom patterns can be added via :meth:add_pattern.

add_pattern(pattern, description, severity=Severity.HIGH)

Register an additional detection pattern.

evaluate(text)

Scan text for prompt injection indicators.

PIILeakageScanner

PIILeakageScanner(*, disabled=None)

Bases: SafetyEvaluator

Detect PII in agent output.

By default detects emails, US phone numbers, SSNs, credit card numbers, and IPv4 addresses. Custom patterns can be added via :meth:add_pattern.

Specific built-in patterns can be disabled via the disabled set passed to the constructor.

add_pattern(name, pattern, severity=Severity.HIGH)

Register a custom PII detection pattern.

evaluate(text)

Scan text for PII.

SystemPromptLeakDetector

SystemPromptLeakDetector()

Bases: SafetyEvaluator

Detect system prompt leakage in agent output.

Checks whether the agent reveals its system prompt, instructions, or configuration when probed directly or indirectly.

For higher accuracy, you can provide known fragments of the actual system prompt via :meth:set_system_prompt — the evaluator will then check for verbatim leaks in addition to pattern matching.

set_system_prompt(prompt, *, min_fragment_len=20)

Provide the actual system prompt for verbatim leak detection.

The prompt is split into fragments of at least min_fragment_len characters. If any fragment appears in the agent output, it's flagged as a leak.

add_pattern(pattern, description, severity=Severity.HIGH)

Register an additional leak detection pattern.

evaluate(text)

Scan text for system prompt leakage.

ToolCallBoundaryValidator

ToolCallBoundaryValidator(boundary=None)

Bases: SafetyEvaluator

Validate that tool calls stay within declared permission boundaries.

Checks three types of boundaries:

  1. Tool allowlist/blocklist — only permitted tools may be called.
  2. Path boundaries — path-like arguments must be within allowed prefixes.
  3. Argument patterns — argument values must not match forbidden regexes.

Usage::

boundary = ToolBoundary(
    allowed_tools={"search", "read_file"},
    allowed_paths=["/home/user/project"],
    forbidden_argument_patterns={"command": r"rm\s+-rf"},
)
validator = ToolCallBoundaryValidator(boundary)
result = validator.evaluate_run(agent_run)
assert result.passed

evaluate(text)

Raise because tool boundary checks require structured tool call data.

Use :meth:evaluate_run with an :class:AgentRun instead.

evaluate_run(run)

Check all tool calls in run against the configured boundary.

RefusalComplianceChecker

RefusalComplianceChecker(*, expect_refusal=True)

Bases: SafetyEvaluator

Check that an agent correctly refuses out-of-scope or harmful requests.

Two modes of operation:

  1. Assert refusal (expect_refusal=True, the default) — the output must contain a refusal indicator. Useful when testing that the agent refuses a harmful or out-of-scope prompt.

  2. Assert no refusal (expect_refusal=False) — the output must not contain a refusal indicator. Useful when testing that the agent handles a legitimate request without over-refusing.

Custom refusal patterns can be added via :meth:add_pattern.

add_pattern(pattern, description)

Register an additional refusal detection pattern.

evaluate(text)

Check text for refusal compliance.

GroundednessEvaluator

GroundednessEvaluator(*, mode='fabrication', min_hedging_signals=1, min_disclaimer_signals=1)

Bases: SafetyEvaluator

Evaluate whether an agent's output is properly grounded.

Two modes:

  • "fabrication" — output should contain hedging or uncertainty markers when asked for specific unverifiable facts. Absence of hedging is flagged as a potential fabrication risk.

  • "uncertainty" — output should contain disclaimers when giving safety-critical advice (medical, financial, legal). Absence of disclaimers is flagged as overconfidence.

::

evaluator = GroundednessEvaluator(mode="fabrication")
result = evaluator.evaluate("NVIDIA stock is exactly $127.43")
assert not result.passed  # no hedging detected

evaluator = GroundednessEvaluator(mode="uncertainty")
result = evaluator.evaluate("Yes, combine those medications freely")
assert not result.passed  # no medical disclaimer detected

add_hedging_pattern(pattern, description)

Register an additional hedging/uncertainty pattern.

add_disclaimer_pattern(pattern, description)

Register an additional disclaimer pattern.

evaluate(text)

Evaluate text for groundedness issues.

Attack Probes

Probe

Probe(input, category, severity=Severity.HIGH, name='', description='', tags=frozenset()) dataclass

A single adversarial test input.

Attributes:

Name Type Description
input str

The adversarial text to send to the agent.

category SafetyCategory

The safety category this probe targets.

severity Severity

Expected severity if the agent fails this probe.

name str

Short human-readable label (used as pytest param ID).

description str

Longer explanation of what this probe tests.

tags frozenset[str]

Arbitrary tags for filtering (e.g. "encoding", "roleplay").

ProbeSet

ProbeSet(probes=None, *, name='')

A composable, iterable collection of :class:Probe instances.

Supports: - Iteration: for probe in probe_set - Concatenation: set_a + set_b - Filtering: probe_set.filter(tags={"encoding"}) - pytest parametrize: @pytest.mark.parametrize("attack", probe_set.all())

all()

Return a copy of this ProbeSet.

The returned ProbeSet is iterable, so it works directly with pytest.mark.parametrize and supports + composition::

combined = injection.direct.all() + jailbreak.all()

filter(*, tags=None, category=None, severity=None)

Return a new ProbeSet with only matching probes.

category and severity accept enum values or case-insensitive strings (e.g. "CRITICAL" or "critical").

Taxonomy

SafetyCategory

SafetyCategory

Bases: str, Enum

Safety test categories mapped to OWASP LLM Top 10.

Severity

Severity

Bases: str, Enum

Severity levels for safety findings.

Supports direct comparison operators (>=, <, etc.)::

Severity.HIGH >= Severity.MEDIUM  # True
Severity.LOW < Severity.CRITICAL  # True

Conversation Scanner

ConversationSafetyScanner(evaluators)

Scans multi-turn conversations with multiple safety evaluators.

Runs each evaluator in two modes:

  1. Per-turn — evaluates each turn's output individually.
  2. Aggregate — evaluates the concatenated output of all turns.

Findings that appear only in aggregate mode are flagged as aggregate_only_findings — these represent split/accumulation attacks invisible to per-turn scanning.

Usage::

scanner = ConversationSafetyScanner([
    PromptInjectionDetector(),
    PIILeakageScanner(),
])
result = scanner.scan(conversation)
assert result.passed
# Or check specifically for split attacks:
assert len(result.aggregate_only_findings) == 0

scan(conversation)

Scan the conversation with all registered evaluators.