评估快速入门

评估是衡量 LLM 应用程序性能的定量方法。LLM 的行为可能不可预测，即使对提示、模型或输入的微小更改也可能显著影响结果。评估提供了一种结构化的方法来识别失败、比较版本并构建更可靠的 AI 应用程序。在 LangSmith 中运行评估需要三个关键组件：

数据集：一组测试输入（以及可选的预期输出）。
目标函数：您想要测试的应用程序部分——这可能是使用新提示的单个 LLM 调用、一个模块或您的整个工作流程。
评估器：对目标函数输出进行评分的函数。

本快速入门指导您运行一个入门评估，该评估使用 LangSmith SDK 或 UI 检查 LLM 响应的正确性。

如果您更喜欢观看有关跟踪入门的视频，请参阅数据集和评估视频指南。

先决条件

在开始之前，请确保您具备：

LangSmith 账户：在 smith.langchain.com 注册或登录。
LangSmith API 密钥：遵循创建 API 密钥指南。
OpenAI API 密钥：从 OpenAI 仪表板生成。

Select the UI or SDK filter for instructions:

1. Set workspace secrets

在 LangSmith UI 中，确保您的 OpenAI API 密钥设置为工作区密钥。

导航到 Settings，然后移至 Secrets 选项卡。
选择 Add secret 并输入 OPENAI_API_KEY 以及您的 API 密钥作为 Value。
选择 Save secret。

在 LangSmith UI 中添加工作区密钥时，请确保密钥键与模型提供商预期的环境变量名称匹配。

2. Create a prompt

LangSmith’s Prompt Playground makes it possible to run evaluations over different prompts, new models, or test different model configurations.

In the LangSmith UI, navigate to the Playground under Prompt Engineering.
Under the Prompts panel, modify the system prompt to:
```
Answer the following question accurately:
```
Leave the Human message as is: {question}.

3. Create a dataset

Click Set up Evaluation, which will open a New Experiment table at the bottom of the page.
In the Select or create a new dataset dropdown, click the + New button to create a new dataset.

Add the following examples to the dataset:

Inputs	Reference Outputs
question: Which country is Mount Kilimanjaro located in?	output: Mount Kilimanjaro is located in Tanzania.
question: What is Earth’s lowest point?	output: Earth’s lowest point is The Dead Sea.

Click Save and enter a name to save your newly created dataset.

4. Add an evaluator

Click + Evaluator and select Correctness from the Pre-built Evaluator options.
In the Correctness panel, click Save.

5. Run your evaluation

Select Start on the top right to run your evaluation. This will create an experiment with a preview in the New Experiment table. You can view in full by clicking the experiment name.

Next steps

To learn more about running experiments in LangSmith, read the evaluation conceptual guide.

For more details on evaluations, refer to the Evaluation documentation.
Learn how to create and manage datasets in the UI.
Learn how to run an evaluation from the prompt playground.

Video guide

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

先决条件

1. Set workspace secrets

2. Create a prompt

3. Create a dataset

4. Add an evaluator

5. Run your evaluation

Next steps

Video guide

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​先决条件

​1. Set workspace secrets

​2. Create a prompt

​3. Create a dataset

​4. Add an evaluator

​5. Run your evaluation

​Next steps

​Video guide

先决条件

1. Set workspace secrets

2. Create a prompt

3. Create a dataset

4. Add an evaluator

5. Run your evaluation

Next steps

Video guide