Skip to content

Judge

LLM-as-judge evaluation with rubrics and statistical assertions.

RubricJudge

RubricJudge(rubric, llm, model_name='')

Bases: Judge

Evaluates agent runs against a rubric using an LLM.

The LLM callable is a simple async function that takes (system, user) prompts and returns a string. This keeps the judge decoupled from any specific SDK.

Example::

async def call_llm(system: str, user: str) -> str:
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
    )
    return response.choices[0].message.content

judge = RubricJudge(rubric=my_rubric, llm=call_llm, model_name="gpt-4o")
score = await judge.evaluate(run)

evaluate(run, **kwargs) async

Evaluate an agent run against the rubric.

Criterion

Criterion

Bases: BaseModel

A single evaluation criterion within a rubric.

Examples::

Criterion(name="empathy", description="Acknowledged frustration?",
          scale_type=ScaleType.NUMERIC, scale=[1, 2, 3, 4, 5])

Criterion(name="accuracy", description="Factually correct?",
          scale_type=ScaleType.BINARY, scale=["pass", "fail"])

max_value property

Maximum numeric score for this criterion.

min_value property

Minimum numeric score for this criterion.

model_post_init(__context)

Set sensible default scale when using BINARY with default numeric scale.

JudgeScore

JudgeScore

Bases: BaseModel

Complete judge output for one evaluation trial.

passed property

Whether the overall score meets or exceeds 0.5 (F-058).

score_for(criterion_name)

Look up a criterion score by name.