Safety¶
Evaluators, attack probes, and OWASP LLM Top 10 taxonomy.
Evaluators¶
PromptInjectionDetector¶
PromptInjectionDetector()
¶
Bases: SafetyEvaluator
Detect signs of prompt injection in agent output.
Scans agent output for patterns that indicate the agent has been manipulated by an injected instruction. Works on the output side — checking whether the agent complied with an injection, not whether the input contained one.
Custom patterns can be added via :meth:add_pattern.
PIILeakageScanner¶
PIILeakageScanner(*, disabled=None)
¶
Bases: SafetyEvaluator
Detect PII in agent output.
By default detects emails, US phone numbers, SSNs, credit card
numbers, and IPv4 addresses. Custom patterns can be added via
:meth:add_pattern.
Specific built-in patterns can be disabled via the disabled set passed to the constructor.
SystemPromptLeakDetector¶
SystemPromptLeakDetector()
¶
Bases: SafetyEvaluator
Detect system prompt leakage in agent output.
Checks whether the agent reveals its system prompt, instructions, or configuration when probed directly or indirectly.
For higher accuracy, you can provide known fragments of the actual
system prompt via :meth:set_system_prompt — the evaluator will
then check for verbatim leaks in addition to pattern matching.
set_system_prompt(prompt, *, min_fragment_len=20)
¶
Provide the actual system prompt for verbatim leak detection.
The prompt is split into fragments of at least min_fragment_len characters. If any fragment appears in the agent output, it's flagged as a leak.
add_pattern(pattern, description, severity=Severity.HIGH)
¶
Register an additional leak detection pattern.
evaluate(text)
¶
Scan text for system prompt leakage.
ToolCallBoundaryValidator¶
ToolCallBoundaryValidator(boundary=None)
¶
Bases: SafetyEvaluator
Validate that tool calls stay within declared permission boundaries.
Checks three types of boundaries:
- Tool allowlist/blocklist — only permitted tools may be called.
- Path boundaries — path-like arguments must be within allowed prefixes.
- Argument patterns — argument values must not match forbidden regexes.
Usage::
boundary = ToolBoundary(
allowed_tools={"search", "read_file"},
allowed_paths=["/home/user/project"],
forbidden_argument_patterns={"command": r"rm\s+-rf"},
)
validator = ToolCallBoundaryValidator(boundary)
result = validator.evaluate_run(agent_run)
assert result.passed
RefusalComplianceChecker¶
RefusalComplianceChecker(*, expect_refusal=True)
¶
Bases: SafetyEvaluator
Check that an agent correctly refuses out-of-scope or harmful requests.
Two modes of operation:
-
Assert refusal (
expect_refusal=True, the default) — the output must contain a refusal indicator. Useful when testing that the agent refuses a harmful or out-of-scope prompt. -
Assert no refusal (
expect_refusal=False) — the output must not contain a refusal indicator. Useful when testing that the agent handles a legitimate request without over-refusing.
Custom refusal patterns can be added via :meth:add_pattern.
GroundednessEvaluator¶
GroundednessEvaluator(*, mode='fabrication', min_hedging_signals=1, min_disclaimer_signals=1)
¶
Bases: SafetyEvaluator
Evaluate whether an agent's output is properly grounded.
Two modes:
-
"fabrication"— output should contain hedging or uncertainty markers when asked for specific unverifiable facts. Absence of hedging is flagged as a potential fabrication risk. -
"uncertainty"— output should contain disclaimers when giving safety-critical advice (medical, financial, legal). Absence of disclaimers is flagged as overconfidence.
::
evaluator = GroundednessEvaluator(mode="fabrication")
result = evaluator.evaluate("NVIDIA stock is exactly $127.43")
assert not result.passed # no hedging detected
evaluator = GroundednessEvaluator(mode="uncertainty")
result = evaluator.evaluate("Yes, combine those medications freely")
assert not result.passed # no medical disclaimer detected
Attack Probes¶
Probe¶
Probe(input, category, severity=Severity.HIGH, name='', description='', tags=frozenset())
dataclass
¶
A single adversarial test input.
Attributes:
| Name | Type | Description |
|---|---|---|
input |
str
|
The adversarial text to send to the agent. |
category |
SafetyCategory
|
The safety category this probe targets. |
severity |
Severity
|
Expected severity if the agent fails this probe. |
name |
str
|
Short human-readable label (used as pytest param ID). |
description |
str
|
Longer explanation of what this probe tests. |
tags |
frozenset[str]
|
Arbitrary tags for filtering (e.g. |
ProbeSet¶
ProbeSet(probes=None, *, name='')
¶
A composable, iterable collection of :class:Probe instances.
Supports:
- Iteration: for probe in probe_set
- Concatenation: set_a + set_b
- Filtering: probe_set.filter(tags={"encoding"})
- pytest parametrize: @pytest.mark.parametrize("attack", probe_set.all())
all()
¶
Return a copy of this ProbeSet.
The returned ProbeSet is iterable, so it works directly
with pytest.mark.parametrize and supports + composition::
combined = injection.direct.all() + jailbreak.all()
filter(*, tags=None, category=None, severity=None)
¶
Return a new ProbeSet with only matching probes.
category and severity accept enum values or case-insensitive
strings (e.g. "CRITICAL" or "critical").
Taxonomy¶
SafetyCategory¶
SafetyCategory
¶
Bases: str, Enum
Safety test categories mapped to OWASP LLM Top 10.
Severity¶
Severity
¶
Bases: str, Enum
Severity levels for safety findings.
Supports direct comparison operators (>=, <, etc.)::
Severity.HIGH >= Severity.MEDIUM # True
Severity.LOW < Severity.CRITICAL # True
Conversation Scanner¶
ConversationSafetyScanner(evaluators)
¶
Scans multi-turn conversations with multiple safety evaluators.
Runs each evaluator in two modes:
- Per-turn — evaluates each turn's output individually.
- Aggregate — evaluates the concatenated output of all turns.
Findings that appear only in aggregate mode are flagged as
aggregate_only_findings — these represent split/accumulation
attacks invisible to per-turn scanning.
Usage::
scanner = ConversationSafetyScanner([
PromptInjectionDetector(),
PIILeakageScanner(),
])
result = scanner.scan(conversation)
assert result.passed
# Or check specifically for split attacks:
assert len(result.aggregate_only_findings) == 0
scan(conversation)
¶
Scan the conversation with all registered evaluators.