LangSmith 与开源 openevals 包集成,提供一套预构建评估器,您可以将其用作评估的起点。
本操作指南将演示如何设置和运行一种评估器类型(LLM 作为评判者)。有关包含使用示例的预构建评估器的完整列表,请参阅 openevalsagentevals 仓库。

设置

您需要安装 openevals 包才能使用预构建的 LLM 作为评判者评估器。
pip install -U openevals
您还需要将 OpenAI API 密钥设置为环境变量,尽管您也可以选择不同的提供商:
export OPENAI_API_KEY="your_openai_api_key"
我们还将使用 LangSmith 的 pytest 集成用于 Python,使用 Vitest/Jest 用于 TypeScript 来运行评估。openevals 也与 evaluate 方法无缝集成。有关设置说明,请参阅相应的指南

运行评估器

一般流程很简单:从 openevals 导入评估器或工厂函数,然后在测试文件中使用输入、输出和参考输出运行它。LangSmith 将自动将评估器的结果记录为反馈。 请注意,并非所有评估器都需要每个参数(例如,完全匹配评估器仅需要输出和参考输出)。此外,如果您的 LLM 作为评判者提示需要其他变量,将它们作为 kwargs 传入将把它们格式化到提示中。 像这样设置您的测试文件:
import pytest
from langsmith import testing as t
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

correctness_evaluator = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    feedback_key="correctness",
    model="openai:o3-mini",
)

# Mock standin for your application
def my_llm_app(inputs: dict) -> str:
    return "Doodads have increased in price by 10% in the past year."

@pytest.mark.langsmith
def test_correctness():
    inputs = "How much has the price of doodads changed in the past year?"
    reference_outputs = "The price of doodads has decreased by 50% in the past year."
    outputs = my_llm_app(inputs)

    t.log_inputs({"question": inputs})
    t.log_outputs({"answer": outputs})
    t.log_reference_outputs({"answer": reference_outputs})

    correctness_evaluator(
        inputs=inputs,
        outputs=outputs,
        reference_outputs=reference_outputs
    )
The feedback_key/feedbackKey parameter will be used as the name of the feedback in your experiment. Running the eval in your terminal will result in something like the following: Prebuilt evaluator terminal result You can also pass prebuilt evaluators directly into the evaluate method if you have already created a dataset in LangSmith. If using Python, this requires langsmith>=0.3.11:
from langsmith import Client
from openevals.llm import create_llm_as_judge
from openevals.prompts import CONCISENESS_PROMPT

client = Client()
conciseness_evaluator = create_llm_as_judge(
    prompt=CONCISENESS_PROMPT,
    feedback_key="conciseness",
    model="openai:o3-mini",
)

experiment_results = client.evaluate(
    # This is a dummy target function, replace with your actual LLM-based system
    lambda inputs: "What color is the sky?",
    data="Sample dataset",
    evaluators=[
        conciseness_evaluator
    ]
)
For a complete list of available evaluators, see the openevals and agentevals repos.
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.