某些指标只能在整个实验级别上定义,而不是在实验的各个运行上定义。例如,您可能希望计算数据集中所有示例的评估目标的总体通过率或 f1 分数。这些被称为摘要评估器。

基本示例

在这里,我们将计算 f1 分数,它是精确率和召回率的组合。 这种指标只能在我们实验的所有示例上计算,因此我们的评估器接受输出列表和 reference_outputs 列表。
def f1_score_summary_evaluator(outputs: list[dict], reference_outputs: list[dict]) -> dict:
    true_positives = 0
    false_positives = 0
    false_negatives = 0

    for output_dict, reference_output_dict in zip(outputs, reference_outputs):
        output = output_dict["class"]
        reference_output = reference_output_dict["class"]

        if output == "Toxic" and reference_output == "Toxic":
            true_positives += 1
        elif output == "Toxic" and reference_output == "Not toxic":
            false_positives += 1
        elif output == "Not toxic" and reference_output == "Toxic":
            false_negatives += 1

    if true_positives == 0:
        return {"key": "f1_score", "score": 0.0}

    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    f1_score = 2 * (precision * recall) / (precision + recall)

    return {"key": "f1_score", "score": f1_score}
You can then pass this evaluator to the evaluate method as follows:
from langsmith import Client

ls_client = Client()
dataset = ls_client.clone_public_dataset(
    "https://smith.langchain.com/public/3d6831e6-1680-4c88-94df-618c8e01fc55/d"
)

def bad_classifier(inputs: dict) -> dict:
    return {"class": "Not toxic"}

def correct(outputs: dict, reference_outputs: dict) -> bool:
    """Row-level correctness evaluator."""
    return outputs["class"] == reference_outputs["label"]

results = ls_client.evaluate(
    bad_classified,
    data=dataset,
    evaluators=[correct],
    summary_evaluators=[pass_50],
)
In the LangSmith UI, you’ll the summary evaluator’s score displayed with the corresponding key. summary_eval.png

Summary evaluator args

Summary evaluator functions must have specific argument names. They can take any subset of the following arguments:
  • inputs: list[dict]: A list of the inputs corresponding to a single example in a dataset.
  • outputs: list[dict]: A list of the dict outputs produced by each experiment on the given inputs.
  • reference_outputs/referenceOutputs: list[dict]: A list of the reference outputs associated with the example, if available.
  • runs: list[Run]: A list of the full Run objects generated by the two experiments on the given example. Use this if you need access to intermediate steps or metadata about each run.
  • examples: list[Example]: All of the dataset Example objects, including the example inputs, outputs (if available), and metdata (if available).

Summary evaluator output

Summary evaluators are expected to return one of the following types: Python and JS/TS
  • dict: dicts of the form {"score": ..., "name": ...} allow you to pass a numeric or boolean score and metric name.
Currently Python only
  • int | float | bool: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.