代码评估器只是接受数据集示例和生成的应用程序输出并返回一个或多个指标的函数。这些函数可以直接传递到 evaluate() / aevaluate()

基本示例

from langsmith import evaluate

def correct(outputs: dict, reference_outputs: dict) -> bool:
    """Check if the answer exactly matches the expected answer."""
    return outputs["answer"] == reference_outputs["answer"]

def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
    dummy_app,
    data="dataset_name",
    evaluators=[correct]
)

评估器参数

代码评估器函数需要使用特定的参数名称,可任选以下子集:
  • run: Run:应用在该示例上生成的完整 Run 对象。
  • example: Example:数据集中的完整 Example,包含示例输入、输出(如果有)及元数据(如果有)。
  • inputs: dict:数据集中单个示例的输入字典。
  • outputs: dict:应用对给定 inputs 产生的输出字典。
  • reference_outputs/referenceOutputs: dict:示例关联的参考输出字典(若存在)。
大多数场景只需 inputsoutputsreference_outputsrunexample 仅在需要额外追踪信息或示例元数据时才必需。 在 JS/TS 中,这些参数会打包成一个对象参数传入。

评估器输出

代码评估器需返回以下类型之一: Python 与 JS/TS 通用
  • dict:形如 {"score" | "value": ..., "key": ...} 的字典,可自定义指标类型(数值用 score,分类用 value)与指标名,适合记录分类型整数等场景。
仅限 Python
  • int | float | bool:被视为可平均/排序的连续型指标,函数名会作为指标名。
  • str:视为分类指标,函数名作为指标名。
  • list[dict]:在单个函数中返回多个指标。

更多示例

需要 langsmith>=0.2.0
from langsmith import evaluate, wrappers
from langsmith.schemas import Run, Example
from openai import AsyncOpenAI
# Assumes you've installed pydantic.
from pydantic import BaseModel

# We can still pass in Run and Example objects if we'd like
def correct_old_signature(run: Run, example: Example) -> dict:
    """Check if the answer exactly matches the expected answer."""
    return {"key": "correct", "score": run.outputs["answer"] == example.outputs["answer"]}

# Just evaluate actual outputs
def concision(outputs: dict) -> int:
    """Score how concise the answer is. 1 is the most concise, 5 is the least concise."""
    return min(len(outputs["answer"]) // 1000, 4) + 1

# Use an LLM-as-a-judge
oai_client = wrappers.wrap_openai(AsyncOpenAI())

async def valid_reasoning(inputs: dict, outputs: dict) -> bool:
    """Use an LLM to judge if the reasoning and the answer are consistent."""
    instructions = """
Given the following question, answer, and reasoning, determine if the reasoning for the
answer is logically valid and consistent with question and the answer."""

    class Response(BaseModel):
        reasoning_is_valid: bool

    msg = f"Question: {inputs['question']}\nAnswer: {outputs['answer']}\nReasoning: {outputs['reasoning']}"
    response = await oai_client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": instructions,}, {"role": "user", "content": msg}],
        response_format=Response
    )
    return response.choices[0].message.parsed.reasoning_is_valid

def dummy_app(inputs: dict) -> dict:
    return {"answer": "hmm i'm not sure", "reasoning": "i didn't understand the question"}

results = evaluate(
    dummy_app,
    data="dataset_name",
    evaluators=[correct_old_signature, concision, valid_reasoning]
)

相关内容


Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.