智能体应用程序让 LLM 决定自己解决问题的下一步。这种灵活性很强大,但模型的黑盒性质使得很难预测智能体某一部分的调整会如何影响其余部分。要构建生产就绪的智能体,彻底的测试至关重要。 测试智能体有几种方法:
  • 单元测试使用内存中的模拟在隔离环境中练习智能体的小型确定性部分,以便您可以快速且确定性地断言确切行为。
  • 集成测试使用真实网络调用测试智能体,以确认组件协同工作、凭据和模式对齐以及延迟可接受。
智能体应用程序往往更依赖集成测试,因为它们将多个组件链接在一起,并且必须处理由于 LLM 的非确定性性质而导致的不稳定性。

单元测试

模拟聊天模型

对于不需要 API 调用的逻辑,您可以使用内存中的存根来模拟响应。 LangChain 提供 GenericFakeChatModel 用于模拟文本响应。它接受响应迭代器(AIMessages 或字符串)并在每次调用时返回一个。它支持常规和流式使用。
from langchain_core.language_models.fake_chat_models import GenericFakeChatModel

model = GenericFakeChatModel(messages=iter([
    AIMessage(content="", tool_calls=[ToolCall(name="foo", args={"bar": "baz"}, id="call_1")]),
    "bar"
]))

model.invoke("hello")
# AIMessage(content='', ..., tool_calls=[{'name': 'foo', 'args': {'bar': 'baz'}, 'id': 'call_1', 'type': 'tool_call'}])
如果我们再次调用模型,它将返回迭代器中的下一项:
model.invoke("hello, again!")
# AIMessage(content='bar', ...)

InMemorySaver 检查点器

要在测试期间启用持久化,您可以使用 InMemorySaver 检查点器。这允许您模拟多轮以测试依赖状态的行为:
from langgraph.checkpoint.memory import InMemorySaver

agent = create_agent(
    model,
    tools=[],
    checkpointer=InMemorySaver()
)

# First invocation
agent.invoke(HumanMessage(content="I live in Sydney, Australia."))

# Second invocation: the first message is persisted (Sydney location), so the model returns GMT+10 time
agent.invoke(HumanMessage(content="What's my local time?"))

集成测试

许多智能体行为只有在使用真实 LLM 时才会出现,例如智能体决定调用哪个工具、如何格式化响应或提示修改是否影响整个执行轨迹。LangChain 的 agentevals 包提供了专门为使用实时模型测试智能体轨迹而设计的评估器。 AgentEvals 让您可以通过执行轨迹匹配或使用 LLM 裁判轻松评估智能体的轨迹(包括工具调用的确切消息序列):

轨迹匹配

为给定输入硬编码参考轨迹,并通过逐步比较验证运行。非常适合测试您知道预期行为的明确定义的工作流程。当您对应该调用哪些工具以及按什么顺序有具体期望时使用。这种方法是确定性的、快速的且成本效益高,因为它不需要额外的 LLM 调用。

LLM-as-judge

使用 LLM 来定性验证智能体的执行轨迹。“裁判” LLM 根据提示标准(可以包括参考轨迹)审查智能体的决策。更灵活,可以评估效率和适当性等细微方面,但需要 LLM 调用且确定性较低。当您想要评估智能体轨迹的整体质量和合理性而不需要严格的工具调用或排序要求时使用。

安装 AgentEvals

pip install agentevals
或者,直接克隆 AgentEvals 存储库

轨迹匹配评估器

AgentEvals 提供 create_trajectory_match_evaluator 函数,将您的智能体轨迹与参考轨迹进行匹配。有四种模式可供选择:
模式描述用例
strict相同顺序的消息和工具调用的精确匹配测试特定序列(例如,在授权之前进行策略查找)
unordered允许以任何顺序进行相同的工具调用在顺序不重要时验证信息检索
subset智能体仅调用参考中的工具(无额外工具)确保智能体不超过预期范围
superset智能体至少调用参考工具(允许额外工具)验证是否采取了所需的最小操作
strict 模式确保轨迹包含相同顺序的相同消息和相同的工具调用,尽管它允许消息内容存在差异。当您需要强制执行特定的操作序列时(例如,在授权操作之前需要策略查找),这很有用。
from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.match import create_trajectory_match_evaluator


@tool
def get_weather(city: str):
    """Get weather information for a city."""
    return f"It's 75 degrees and sunny in {city}."

agent = create_agent("gpt-4o", tools=[get_weather])

evaluator = create_trajectory_match_evaluator(  
    trajectory_match_mode="strict",  
)  

def test_weather_tool_called_strict():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in San Francisco?")]
    })

    reference_trajectory = [
        HumanMessage(content="What's the weather in San Francisco?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_weather", "args": {"city": "San Francisco"}}
        ]),
        ToolMessage(content="It's 75 degrees and sunny in San Francisco.", tool_call_id="call_1"),
        AIMessage(content="The weather in San Francisco is 75 degrees and sunny."),
    ]

    evaluation = evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory
    )
    # {
    #     'key': 'trajectory_strict_match',
    #     'score': True,
    #     'comment': None,
    # }
    assert evaluation["score"] is True
The unordered mode allows the same tool calls in any order, which is helpful when you want to verify that specific information was retrieved but don’t care about the sequence. For example, an agent might need to check both weather and events for a city, but the order doesn’t matter.
from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.match import create_trajectory_match_evaluator


@tool
def get_weather(city: str):
    """Get weather information for a city."""
    return f"It's 75 degrees and sunny in {city}."

@tool
def get_events(city: str):
    """Get events happening in a city."""
    return f"Concert at the park in {city} tonight."

agent = create_agent("gpt-4o", tools=[get_weather, get_events])

evaluator = create_trajectory_match_evaluator(  
    trajectory_match_mode="unordered",  
)  

def test_multiple_tools_any_order():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's happening in SF today?")]
    })

    # Reference shows tools called in different order than actual execution
    reference_trajectory = [
        HumanMessage(content="What's happening in SF today?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_events", "args": {"city": "SF"}},
            {"id": "call_2", "name": "get_weather", "args": {"city": "SF"}},
        ]),
        ToolMessage(content="Concert at the park in SF tonight.", tool_call_id="call_1"),
        ToolMessage(content="It's 75 degrees and sunny in SF.", tool_call_id="call_2"),
        AIMessage(content="Today in SF: 75 degrees and sunny with a concert at the park tonight."),
    ]

    evaluation = evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory,
    )
    # {
    #     'key': 'trajectory_unordered_match',
    #     'score': True,
    # }
    assert evaluation["score"] is True
The superset and subset modes match partial trajectories. The superset mode verifies that the agent called at least the tools in the reference trajectory, allowing additional tool calls. The subset mode ensures the agent did not call any tools beyond those in the reference.
from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.match import create_trajectory_match_evaluator


@tool
def get_weather(city: str):
    """Get weather information for a city."""
    return f"It's 75 degrees and sunny in {city}."

@tool
def get_detailed_forecast(city: str):
    """Get detailed weather forecast for a city."""
    return f"Detailed forecast for {city}: sunny all week."

agent = create_agent("gpt-4o", tools=[get_weather, get_detailed_forecast])

evaluator = create_trajectory_match_evaluator(  
    trajectory_match_mode="superset",  
)  

def test_agent_calls_required_tools_plus_extra():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in Boston?")]
    })

    # Reference only requires get_weather, but agent may call additional tools
    reference_trajectory = [
        HumanMessage(content="What's the weather in Boston?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_weather", "args": {"city": "Boston"}},
        ]),
        ToolMessage(content="It's 75 degrees and sunny in Boston.", tool_call_id="call_1"),
        AIMessage(content="The weather in Boston is 75 degrees and sunny."),
    ]

    evaluation = evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory,
    )
    # {
    #     'key': 'trajectory_superset_match',
    #     'score': True,
    #     'comment': None,
    # }
    assert evaluation["score"] is True
You can also set the tool_args_match_mode property and/or tool_args_match_overrides to customize how the evaluator considers equality between tool calls in the actual trajectory vs. the reference. By default, only tool calls with the same arguments to the same tool are considered equal. Visit the repository for more details.

LLM-as-Judge 评估器

You can also use an LLM to evaluate the agent’s execution path with the create_trajectory_llm_as_judge function. Unlike the trajectory match evaluators, it doesn’t require a reference trajectory, but one can be provided if available.
from langchain.agents import create_agent
from langchain.tools import tool
from langchain.messages import HumanMessage, AIMessage, ToolMessage
from agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT


@tool
def get_weather(city: str):
    """Get weather information for a city."""
    return f"It's 75 degrees and sunny in {city}."

agent = create_agent("gpt-4o", tools=[get_weather])

evaluator = create_trajectory_llm_as_judge(  
    model="openai:o3-mini",  
    prompt=TRAJECTORY_ACCURACY_PROMPT,  
)  

def test_trajectory_quality():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in Seattle?")]
    })

    evaluation = evaluator(
        outputs=result["messages"],
    )
    # {
    #     'key': 'trajectory_accuracy',
    #     'score': True,
    #     'comment': 'The provided agent trajectory is reasonable...'
    # }
    assert evaluation["score"] is True
If you have a reference trajectory, you can add an extra variable to your prompt and pass in the reference trajectory. Below, we use the prebuilt TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE prompt and configure the reference_outputs variable:
evaluator = create_trajectory_llm_as_judge(
    model="openai:o3-mini",
    prompt=TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
)
evaluation = judge_with_reference(
    outputs=result["messages"],
    reference_outputs=reference_trajectory,
)
For more configurability over how the LLM evaluates the trajectory, visit the repository.

异步支持

所有 agentevals 评估器都支持 Python asyncio。对于使用工厂函数的评估器,通过在函数名中的 create_ 后添加 async 来提供异步版本。
from agentevals.trajectory.llm import create_async_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT
from agentevals.trajectory.match import create_async_trajectory_match_evaluator

async_judge = create_async_trajectory_llm_as_judge(
    model="openai:o3-mini",
    prompt=TRAJECTORY_ACCURACY_PROMPT,
)

async_evaluator = create_async_trajectory_match_evaluator(
    trajectory_match_mode="strict",
)

async def test_async_evaluation():
    result = await agent.ainvoke({
        "messages": [HumanMessage(content="What's the weather?")]
    })

    evaluation = await async_judge(outputs=result["messages"])
    assert evaluation["score"] is True

LangSmith 集成

为了长期跟踪实验,您可以将评估器结果记录到 LangSmith,这是一个用于构建生产级 LLM 应用程序的平台,包括跟踪、评估和实验工具。 首先,通过设置所需的环境变量来设置 LangSmith:
export LANGSMITH_API_KEY="your_langsmith_api_key"
export LANGSMITH_TRACING="true"
LangSmith offers two main approaches for running evaluations: pytest integration and the evaluate function.
import pytest
from langsmith import testing as t
from agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT

trajectory_evaluator = create_trajectory_llm_as_judge(
    model="openai:o3-mini",
    prompt=TRAJECTORY_ACCURACY_PROMPT,
)

@pytest.mark.langsmith
def test_trajectory_accuracy():
    result = agent.invoke({
        "messages": [HumanMessage(content="What's the weather in SF?")]
    })

    reference_trajectory = [
        HumanMessage(content="What's the weather in SF?"),
        AIMessage(content="", tool_calls=[
            {"id": "call_1", "name": "get_weather", "args": {"city": "SF"}},
        ]),
        ToolMessage(content="It's 75 degrees and sunny in SF.", tool_call_id="call_1"),
        AIMessage(content="The weather in SF is 75 degrees and sunny."),
    ]

    # Log inputs, outputs, and reference outputs to LangSmith
    t.log_inputs({})
    t.log_outputs({"messages": result["messages"]})
    t.log_reference_outputs({"messages": reference_trajectory})

    trajectory_evaluator(
        outputs=result["messages"],
        reference_outputs=reference_trajectory
    )
Run the evaluation with pytest:
pytest test_trajectory.py --langsmith-output
Results will be automatically logged to LangSmith.
Alternatively, you can create a dataset in LangSmith and use the evaluate function:
from langsmith import Client
from agentevals.trajectory.llm import create_trajectory_llm_as_judge, TRAJECTORY_ACCURACY_PROMPT

client = Client()

trajectory_evaluator = create_trajectory_llm_as_judge(
    model="openai:o3-mini",
    prompt=TRAJECTORY_ACCURACY_PROMPT,
)

def run_agent(inputs):
    """Your agent function that returns trajectory messages."""
    return agent.invoke(inputs)["messages"]

experiment_results = client.evaluate(
    run_agent,
    data="your_dataset_name",
    evaluators=[trajectory_evaluator]
)
Results will be automatically logged to LangSmith.
To learn more about evaluating your agent, see the LangSmith docs.

记录和重放 HTTP 调用

调用真实 LLM API 的集成测试可能很慢且昂贵,尤其是在 CI/CD 管道中频繁运行时。我们建议使用库来记录 HTTP 请求和响应,然后在后续运行中重放它们,而无需进行实际的网络调用。 您可以使用 vcrpy 来实现这一点。如果您使用 pytestpytest-recording 插件提供了一种简单的方法,只需最少的配置即可启用此功能。请求/响应记录在 cassettes 中,然后在后续运行中用于模拟真实的网络调用。 Set up your conftest.py file to filter out sensitive information from the cassettes:
conftest.py
import pytest

@pytest.fixture(scope="session")
def vcr_config():
    return {
        "filter_headers": [
            ("authorization", "XXXX"),
            ("x-api-key", "XXXX"),
            # ... other headers you want to mask
        ],
        "filter_query_parameters": [
            ("api_key", "XXXX"),
            ("key", "XXXX"),
        ],
    }
Then configure your project to recognise the vcr marker:
[pytest]
markers =
    vcr: record/replay HTTP via VCR
addopts = --record-mode=once
The --record-mode=once option records HTTP interactions on the first run and replays them on subsequent runs.
Now, simply decorate your tests with the vcr marker:
@pytest.mark.vcr()
def test_agent_trajectory():
    # ...
The first time you run this test, your agent will make real network calls and pytest will generate a cassette file test_agent_trajectory.yaml in the tests/cassettes directory. Subsequent runs will use that cassette to mock the real network calls, granted the agent’s requests don’t change from the previous run. If they do, the test will fail and you’ll need to delete the cassette and rerun the test to record fresh interactions.
When you modify prompts, add new tools, or change expected trajectories, your saved cassettes will become outdated and your existing tests will fail. You should delete the corresponding cassette files and rerun the tests to record fresh interactions.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.