Judge¶
LLM-as-judge evaluation with rubrics and statistical assertions.
RubricJudge¶
RubricJudge(rubric, llm, model_name='')
¶
Bases: Judge
Evaluates agent runs against a rubric using an LLM.
The LLM callable is a simple async function that takes (system, user) prompts and returns a string. This keeps the judge decoupled from any specific SDK.
Example::
async def call_llm(system: str, user: str) -> str:
response = await client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user},
],
)
return response.choices[0].message.content
judge = RubricJudge(rubric=my_rubric, llm=call_llm, model_name="gpt-4o")
score = await judge.evaluate(run)
evaluate(run, **kwargs)
async
¶
Evaluate an agent run against the rubric.
Criterion¶
Criterion
¶
Bases: BaseModel
A single evaluation criterion within a rubric.
Examples::
Criterion(name="empathy", description="Acknowledged frustration?",
scale_type=ScaleType.NUMERIC, scale=[1, 2, 3, 4, 5])
Criterion(name="accuracy", description="Factually correct?",
scale_type=ScaleType.BINARY, scale=["pass", "fail"])