如何评估 Runnable

设置
评估
Related

langchain：Python 和 JS/TS
Runnable：Python 和 JS/TS

langchain Runnable 对象（例如聊天模型、检索器、链等）可以直接传递到 evaluate() / aevaluate()。

设置

让我们定义一个简单的链来评估。首先，安装所有必需的包：

pip install -U langsmith langchain[openai]

现在定义一个链：

from langchain.chat_models import init_chat_model
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

instructions = (
    "请查看下面的用户查询并确定它是否包含任何形式的"
    "有毒行为，例如侮辱、威胁或高度负面的评论。"
    "如果包含，则回复 'Toxic'，如果不包含，则回复 'Not toxic'。"
)

prompt = ChatPromptTemplate(
    [("system", instructions), ("user", "{text}")],
)

model = init_chat_model("gpt-4o")
chain = prompt | model | StrOutputParser()

评估

要评估我们的链，我们可以将其直接传递到 evaluate() / aevaluate() 方法。请注意，链的输入变量必须与示例输入的键匹配。在这种情况下，示例输入应具有形式 {"text": "..."}。

from langsmith import aevaluate, Client

client = Client()

# Clone a dataset of texts with toxicity labels.
# Each example input has a "text" key and each output has a "label" key.
dataset = client.clone_public_dataset(
    "https://smith.langchain.com/public/3d6831e6-1680-4c88-94df-618c8e01fc55/d"
)

def correct(outputs: dict, reference_outputs: dict) -> bool:
    # Since our chain outputs a string not a dict, this string
    # gets stored under the default "output" key in the outputs dict:
    actual = outputs["output"]
    expected = reference_outputs["label"]
    return actual == expected

results = await aevaluate(
    chain,
    data=dataset,
    evaluators=[correct],
    experiment_prefix="gpt-4o, baseline",
)

The runnable is traced appropriately for each output. Runnable Evaluation

How to evaluate a langgraph graph

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

如何在本地运行评估（仅 Python）

如何评估图

⌘I

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

设置

评估

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​设置

​评估

​Related

设置

评估

Related