评估是衡量 LLM 应用程序性能的定量方法。LLM 的行为可能不可预测,即使对提示、模型或输入的微小更改也可能显著影响结果。评估提供了一种结构化的方法来识别失败、比较版本并构建更可靠的 AI 应用程序。 在 LangSmith 中运行评估需要三个关键组件:
  • 数据集:一组测试输入(以及可选的预期输出)。
  • 目标函数:您想要测试的应用程序部分——这可能是使用新提示的单个 LLM 调用、一个模块或您的整个工作流程。
  • 评估器:对目标函数输出进行评分的函数。
本快速入门指导您运行一个入门评估,该评估使用 LangSmith SDK 或 UI 检查 LLM 响应的正确性。
如果您更喜欢观看有关跟踪入门的视频,请参阅数据集和评估视频指南

先决条件

在开始之前,请确保您具备: Select the UI or SDK filter for instructions:
  • UI
  • SDK

1. Set workspace secrets

LangSmith UI 中,确保您的 OpenAI API 密钥设置为工作区密钥
  1. 导航到 Settings,然后移至 Secrets 选项卡。
  2. 选择 Add secret 并输入 OPENAI_API_KEY 以及您的 API 密钥作为 Value
  3. 选择 Save secret
在 LangSmith UI 中添加工作区密钥时,请确保密钥键与模型提供商预期的环境变量名称匹配。

2. Create a prompt

LangSmith’s Prompt Playground makes it possible to run evaluations over different prompts, new models, or test different model configurations.
  1. In the LangSmith UI, navigate to the Playground under Prompt Engineering.
  2. Under the Prompts panel, modify the system prompt to:
    Answer the following question accurately:
    
    Leave the Human message as is: {question}.

3. Create a dataset

  1. Click Set up Evaluation, which will open a New Experiment table at the bottom of the page.
  2. In the Select or create a new dataset dropdown, click the + New button to create a new dataset.
    Playground with the edited system prompt and new experiment with the dropdown for creating a new dataset.
  3. Add the following examples to the dataset:
    InputsReference Outputs
    question: Which country is Mount Kilimanjaro located in?output: Mount Kilimanjaro is located in Tanzania.
    question: What is Earth’s lowest point?output: Earth’s lowest point is The Dead Sea.
  4. Click Save and enter a name to save your newly created dataset.

4. Add an evaluator

  1. Click + Evaluator and select Correctness from the Pre-built Evaluator options.
  2. In the Correctness panel, click Save.

5. Run your evaluation

  1. Select Start on the top right to run your evaluation. This will create an experiment with a preview in the New Experiment table. You can view in full by clicking the experiment name.
    Full experiment view of the results that used the example dataset.

Next steps

To learn more about running experiments in LangSmith, read the evaluation conceptual guide.

Video guide


Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.