如何使用预构建评估器

LangSmith 与开源 openevals 包集成，提供一套预构建评估器，您可以将其用作评估的起点。

本操作指南将演示如何设置和运行一种评估器类型（LLM 作为评判者）。有关包含使用示例的预构建评估器的完整列表，请参阅 openevals 和 agentevals 仓库。

设置

您需要安装 openevals 包才能使用预构建的 LLM 作为评判者评估器。

pip install -U openevals

您还需要将 OpenAI API 密钥设置为环境变量，尽管您也可以选择不同的提供商：

export OPENAI_API_KEY="your_openai_api_key"

我们还将使用 LangSmith 的 pytest 集成用于 Python，使用 Vitest/Jest 用于 TypeScript 来运行评估。openevals 也与 evaluate 方法无缝集成。有关设置说明，请参阅相应的指南。

运行评估器

一般流程很简单：从 openevals 导入评估器或工厂函数，然后在测试文件中使用输入、输出和参考输出运行它。LangSmith 将自动将评估器的结果记录为反馈。请注意，并非所有评估器都需要每个参数（例如，完全匹配评估器仅需要输出和参考输出）。此外，如果您的 LLM 作为评判者提示需要其他变量，将它们作为 kwargs 传入将把它们格式化到提示中。像这样设置您的测试文件：

import pytest
from langsmith import testing as t
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

correctness_evaluator = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    feedback_key="correctness",
    model="openai:o3-mini",
)

# Mock standin for your application
def my_llm_app(inputs: dict) -> str:
    return "Doodads have increased in price by 10% in the past year."

@pytest.mark.langsmith
def test_correctness():
    inputs = "How much has the price of doodads changed in the past year?"
    reference_outputs = "The price of doodads has decreased by 50% in the past year."
    outputs = my_llm_app(inputs)

    t.log_inputs({"question": inputs})
    t.log_outputs({"answer": outputs})
    t.log_reference_outputs({"answer": reference_outputs})

    correctness_evaluator(
        inputs=inputs,
        outputs=outputs,
        reference_outputs=reference_outputs
    )

The feedback_key/feedbackKey parameter will be used as the name of the feedback in your experiment. Running the eval in your terminal will result in something like the following: Prebuilt evaluator terminal result

You can also pass prebuilt evaluators directly into the evaluate method if you have already created a dataset in LangSmith. If using Python, this requires langsmith>=0.3.11:

from langsmith import Client
from openevals.llm import create_llm_as_judge
from openevals.prompts import CONCISENESS_PROMPT

client = Client()
conciseness_evaluator = create_llm_as_judge(
    prompt=CONCISENESS_PROMPT,
    feedback_key="conciseness",
    model="openai:o3-mini",
)

experiment_results = client.evaluate(
    # This is a dummy target function, replace with your actual LLM-based system
    lambda inputs: "What color is the sky?",
    data="Sample dataset",
    evaluators=[
        conciseness_evaluator
    ]
)

For a complete list of available evaluators, see the openevals and agentevals repos.

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

如何使用预构建评估器

设置

运行评估器

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​设置

​运行评估器

设置

运行评估器