Judge¶

LLM-as-judge evaluation with rubrics and statistical assertions.

RubricJudge¶

`RubricJudge(rubric, llm, model_name='')` ¶

Bases: Judge

Evaluates agent runs against a rubric using an LLM.

The LLM callable is a simple async function that takes (system, user) prompts and returns a string. This keeps the judge decoupled from any specific SDK.

Example::

async def call_llm(system: str, user: str) -> str:
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
    )
    return response.choices[0].message.content

judge = RubricJudge(rubric=my_rubric, llm=call_llm, model_name="gpt-4o")
score = await judge.evaluate(run)

`evaluate(run, **kwargs)` `async` ¶

Evaluate an agent run against the rubric.

Criterion¶

`Criterion` ¶

Bases: BaseModel

A single evaluation criterion within a rubric.

Examples::

Criterion(name="empathy", description="Acknowledged frustration?",
          scale_type=ScaleType.NUMERIC, scale=[1, 2, 3, 4, 5])

Criterion(name="accuracy", description="Factually correct?",
          scale_type=ScaleType.BINARY, scale=["pass", "fail"])

`max_value` `property` ¶

Maximum numeric score for this criterion.

`min_value` `property` ¶

Minimum numeric score for this criterion.

`model_post_init(__context)` ¶

Set sensible default scale when using BINARY with default numeric scale.

JudgeScore¶

`JudgeScore` ¶

Bases: BaseModel

Complete judge output for one evaluation trial.

`passed` `property` ¶

Whether the overall score meets or exceeds 0.5 (F-058).

`score_for(criterion_name)` ¶

Look up a criterion score by name.

Judge¶

RubricJudge¶

RubricJudge(rubric, llm, model_name='') ¶

evaluate(run, **kwargs) async ¶

Criterion¶

Criterion ¶

max_value property ¶

min_value property ¶

model_post_init(__context) ¶