- 以编程方式,通过在代码中指定评估器(有关详细信息,请参阅本指南)
- 通过在 UI 中将评估器绑定到数据集。这将自动在创建的任何新实验上运行评估器,以及您通过 SDK 设置的任何评估器。这在您迭代应用程序(目标函数)并拥有希望为所有实验运行的标准评估器集时很有用。
在数据集上配置评估器
- 单击侧边栏中的 Datasets and Experiments 选项卡。
- 选择您要为其配置评估器的数据集。
- 单击 + Evaluator 按钮以向数据集添加评估器。这将打开一个窗格,您可以使用它来配置评估器。
当您为数据集配置评估器时,它只会影响在配置评估器后创建的实验运行。它不会影响在配置评估器之前创建的实验运行的评估。
LLM-as-a-judge evaluators
The process for binding evaluators to a dataset is very similar to the process for configuring a LLM-as-a-judge evaluator in the Playground. View instructions for configuring an LLM-as-a-judge evaluator in the Playground.Custom code evaluators
The process for binding a code evaluators to a dataset is very similar to the process for configuring a code evaluator in online evaluation. View instruction for configuring code evaluators. The only difference between configuring a code evaluator in online evaluation and binding a code evaluator to a dataset is that the custom code evaluator can reference outputs that are part of the dataset’sExample.
For custom code evaluators bound to a dataset, the evaluator function takes in two arguments:
- A
Run(reference). This represents the new run in your experiment. For example, if you ran an experiment via SDK, this would contain the input/output from your chain or model you are testing. - An
Example(reference). This represents the reference example in your dataset that the chain or model you are testing uses. Theinputsto the Run and Example should be the same. If your Example has a referenceoutputs, then you can use this to compare to the run’s output for scoring.
Next steps
- Analyze your experiment results in the experiments tab
- Compare your experiment results in the comparison view