自动在实验上运行评估器

LangSmith 支持两种方式来评分通过 SDK 创建的实验：

以编程方式，通过在代码中指定评估器（有关详细信息，请参阅本指南）
通过在 UI 中将评估器绑定到数据集。这将自动在创建的任何新实验上运行评估器，以及您通过 SDK 设置的任何评估器。这在您迭代应用程序（目标函数）并拥有希望为所有实验运行的标准评估器集时很有用。

在数据集上配置评估器

单击侧边栏中的 Datasets and Experiments 选项卡。
选择您要为其配置评估器的数据集。
单击 + Evaluator 按钮以向数据集添加评估器。这将打开一个窗格，您可以使用它来配置评估器。

当您为数据集配置评估器时，它只会影响在配置评估器后创建的实验运行。它不会影响在配置评估器之前创建的实验运行的评估。

LLM-as-a-judge evaluators

The process for binding evaluators to a dataset is very similar to the process for configuring a LLM-as-a-judge evaluator in the Playground. View instructions for configuring an LLM-as-a-judge evaluator in the Playground.

Custom code evaluators

The process for binding a code evaluators to a dataset is very similar to the process for configuring a code evaluator in online evaluation. View instruction for configuring code evaluators. The only difference between configuring a code evaluator in online evaluation and binding a code evaluator to a dataset is that the custom code evaluator can reference outputs that are part of the dataset’s Example. For custom code evaluators bound to a dataset, the evaluator function takes in two arguments:

A Run (reference). This represents the new run in your experiment. For example, if you ran an experiment via SDK, this would contain the input/output from your chain or model you are testing.
An Example (reference). This represents the reference example in your dataset that the chain or model you are testing uses. The inputs to the Run and Example should be the same. If your Example has a reference outputs, then you can use this to compare to the run’s output for scoring.

The code below shows an example of a simple evaluator function that checks that the outputs exactly equal the reference outputs.

import numpy as np

def perform_eval(run, example):
    # run is a Run object
    # example is an Example object
    output = run['outputs']['output']
    ref_output = example['outputs']['outputs']
    output_match = np.array_equal(output, ref_output)

    return { "exact_match": output_match }

Next steps

Analyze your experiment results in the experiments tab
Compare your experiment results in the comparison view

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

自动在实验上运行评估器

在数据集上配置评估器

LLM-as-a-judge evaluators

Custom code evaluators

Next steps

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​在数据集上配置评估器

​LLM-as-a-judge evaluators

​Custom code evaluators

​Next steps

在数据集上配置评估器

LLM-as-a-judge evaluators

Custom code evaluators

Next steps