如何评估 LLM 应用程序

本指南向您展示如何使用 LangSmith SDK 对 LLM 应用程序运行评估。

在本指南中，我们将介绍如何使用 LangSmith SDK 中的 evaluate() 方法评估应用程序。

对于 Python 中的大型评估作业，我们建议使用 aevaluate()，即 evaluate() 的异步版本。在阅读有关异步运行评估的操作指南之前，仍然值得先阅读本指南，因为两者具有相同的接口。在 JS/TS 中，evaluate() 已经是异步的，因此不需要单独的方法。在运行大型作业时，配置 max_concurrency/maxConcurrency 参数也很重要。这通过有效地将数据集拆分到线程来并行化评估。

定义应用程序

首先我们需要一个要评估的应用程序。让我们为此示例创建一个简单的毒性分类器。

from langsmith import traceable, wrappers
from openai import OpenAI

# Optionally wrap the OpenAI client to trace all model calls.
oai_client = wrappers.wrap_openai(OpenAI())

# Optionally add the 'traceable' decorator to trace the inputs/outputs of this function.
@traceable
def toxicity_classifier(inputs: dict) -> dict:
    instructions = (
      "Please review the user query below and determine if it contains any form of toxic behavior, "
      "such as insults, threats, or highly negative comments. Respond with 'Toxic' if it does "
      "and 'Not toxic' if it doesn't."
    )
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": inputs["text"]},
    ]
    result = oai_client.chat.completions.create(
        messages=messages, model="gpt-4o-mini", temperature=0
    )
    return {"class": result.choices[0].message.content}

We’ve optionally enabled tracing to capture the inputs and outputs of each step in the pipeline. To understand how to annotate your code for tracing, please refer to this guide.

Create or select a dataset

We need a Dataset to evaluate our application on. Our dataset will contain labeled examples of toxic and non-toxic text. Requires langsmith>=0.3.13

from langsmith import Client
ls_client = Client()

examples = [
  {
    "inputs": {"text": "Shut up, idiot"},
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "You're a wonderful person"},
    "outputs": {"label": "Not toxic"},
  },
  {
    "inputs": {"text": "This is the worst thing ever"},
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "I had a great day today"},
    "outputs": {"label": "Not toxic"},
  },
  {
    "inputs": {"text": "Nobody likes you"},
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "This is unacceptable. I want to speak to the manager."},
    "outputs": {"label": "Not toxic"},
  },
]

dataset = ls_client.create_dataset(dataset_name="Toxic Queries")
ls_client.create_examples(
  dataset_id=dataset.id,
  examples=examples,
)

For more details on datasets, refer to the Manage datasets page.

Define an evaluator

You can also check out LangChain’s open source evaluation package openevals for common pre-built evaluators.

Evaluators are functions for scoring your application’s outputs. They take in the example inputs, actual outputs, and, when present, the reference outputs. Since we have labels for this task, our evaluator can directly check if the actual outputs match the reference outputs.

Python: Requires langsmith>=0.3.13
TypeScript: Requires langsmith>=0.2.9

def correct(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    return outputs["class"] == reference_outputs["label"]

Run the evaluation

We’ll use the evaluate() / aevaluate() methods to run the evaluation. The key arguments are:

a target function that takes an input dictionary and returns an output dictionary. The example.inputs field of each Example is what gets passed to the target function. In this case our toxicity_classifier is already set up to take in example inputs so we can use it directly.
data - the name OR UUID of the LangSmith dataset to evaluate on, or an iterator of examples
evaluators - a list of evaluators to score the outputs of the function

Python: Requires langsmith>=0.3.13

# Can equivalently use the 'evaluate' function directly:
# from langsmith import evaluate; evaluate(...)
results = ls_client.evaluate(
    toxicity_classifier,
    data=dataset.name,
    evaluators=[correct],
    experiment_prefix="gpt-4o-mini, baseline",  # optional, experiment name prefix
    description="Testing the baseline system.",  # optional, experiment description
    max_concurrency=4, # optional, add concurrency
)

Explore the results

Each invocation of evaluate() creates an Experiment which can be viewed in the LangSmith UI or queried via the SDK. Evaluation scores are stored against each actual output as feedback. If you’ve annotated your code for tracing, you can open the trace of each row in a side panel view.

Reference code

Click to see a consolidated code snippet

from langsmith import Client, traceable, wrappers
from openai import OpenAI

# Step 1. Define an application
oai_client = wrappers.wrap_openai(OpenAI())

@traceable
def toxicity_classifier(inputs: dict) -> str:
    system = (
      "Please review the user query below and determine if it contains any form of toxic behavior, "
      "such as insults, threats, or highly negative comments. Respond with 'Toxic' if it does "
      "and 'Not toxic' if it doesn't."
    )
    messages = [
        {"role": "system", "content": system},
        {"role": "user", "content": inputs["text"]},
    ]
    result = oai_client.chat.completions.create(
        messages=messages, model="gpt-4o-mini", temperature=0
    )
    return result.choices[0].message.content

# Step 2. Create a dataset
ls_client = Client()
dataset = ls_client.create_dataset(dataset_name="Toxic Queries")
examples = [
  {
    "inputs": {"text": "Shut up, idiot"},
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "You're a wonderful person"},
    "outputs": {"label": "Not toxic"},
  },
  {
    "inputs": {"text": "This is the worst thing ever"},
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "I had a great day today"},
    "outputs": {"label": "Not toxic"},
  },
  {
    "inputs": {"text": "Nobody likes you"},
    "outputs": {"label": "Toxic"},
  },
  {
    "inputs": {"text": "This is unacceptable. I want to speak to the manager."},
    "outputs": {"label": "Not toxic"},
  },
]
ls_client.create_examples(
  dataset_id=dataset.id,
  examples=examples,
)

# Step 3. Define an evaluator
def correct(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    return outputs["output"] == reference_outputs["label"]

# Step 4. Run the evaluation
# Client.evaluate() and evaluate() behave the same.
results = ls_client.evaluate(
    toxicity_classifier,
    data=dataset.name,
    evaluators=[correct],
    experiment_prefix="gpt-4o-mini, simple",  # optional, experiment name prefix
    description="Testing the baseline system.",  # optional, experiment description
    max_concurrency=4,  # optional, add concurrency
)

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

定义应用程序

Create or select a dataset

Define an evaluator

Run the evaluation

Explore the results

Reference code

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​定义应用程序

​Create or select a dataset

​Define an evaluator

​Run the evaluation

​Explore the results​

​Reference code​

​Related​

定义应用程序

Create or select a dataset

Define an evaluator

Run the evaluation

Explore the results

Reference code

Related