智能体应用程序让 LLM 决定自己解决问题的下一步。这种灵活性很强大,但模型的黑盒性质使得很难预测智能体某一部分的调整会如何影响其余部分。要构建生产就绪的智能体,彻底的测试至关重要。 测试智能体有几种方法:
  • 单元测试使用内存中的模拟在隔离环境中练习智能体的小型确定性部分,以便您可以快速且确定性地断言确切行为。
  • 集成测试使用真实网络调用测试智能体,以确认组件协同工作、凭据和模式对齐以及延迟可接受。
智能体应用程序往往更依赖集成测试,因为它们将多个组件链接在一起,并且必须处理由于 LLM 的非确定性性质而导致的不稳定性。

集成测试

许多智能体行为只有在使用真实 LLM 时才会出现,例如智能体决定调用哪个工具、如何格式化响应或提示修改是否影响整个执行轨迹。LangChain 的 agentevals 包提供了专门为使用实时模型测试智能体轨迹而设计的评估器。 AgentEvals 让您可以通过执行轨迹匹配或使用 LLM 裁判轻松评估智能体的轨迹(包括工具调用的确切消息序列):

轨迹匹配

为给定输入硬编码参考轨迹,并通过逐步比较验证运行。非常适合测试您知道预期行为的明确定义的工作流程。当您对应该调用哪些工具以及按什么顺序有具体期望时使用。这种方法是确定性的、快速的且成本效益高,因为它不需要额外的 LLM 调用。

LLM-as-judge

使用 LLM 来定性验证智能体的执行轨迹。“裁判” LLM 根据提示标准(可以包括参考轨迹)审查智能体的决策。更灵活,可以评估效率和适当性等细微方面,但需要 LLM 调用且确定性较低。当您想要评估智能体轨迹的整体质量和合理性而不需要严格的工具调用或排序要求时使用。

安装 AgentEvals

npm install agentevals @langchain/core
或者,直接克隆 AgentEvals 存储库

轨迹匹配评估器

AgentEvals 提供 createTrajectoryMatchEvaluator 函数,将您的智能体轨迹与参考轨迹进行匹配。有四种模式可供选择:
模式描述用例
strict相同顺序的消息和工具调用的精确匹配测试特定序列(例如,在授权之前进行策略查找)
unordered允许以任何顺序进行相同的工具调用在顺序不重要时验证信息检索
subset智能体仅调用参考中的工具(无额外工具)确保智能体不超过预期范围
superset智能体至少调用参考工具(允许额外工具)验证是否采取了所需的最小操作
strict 模式确保轨迹包含相同顺序的相同消息和相同的工具调用,尽管它允许消息内容存在差异。当您需要强制执行特定的操作序列时(例如,在授权操作之前需要策略查找),这很有用。
import { createAgent, tool, HumanMessage, AIMessage, ToolMessage } from "langchain"
import { createTrajectoryMatchEvaluator } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }: { city: string }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({
      city: z.string(),
    }),
  }
);

const agent = createAgent({
  model: "gpt-4o",
  tools: [getWeather]
});

const evaluator = createTrajectoryMatchEvaluator({  
  trajectoryMatchMode: "strict",  
});  

async function testWeatherToolCalledStrict() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's the weather in San Francisco?")]
  });

  const referenceTrajectory = [
    new HumanMessage("What's the weather in San Francisco?"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_weather", args: { city: "San Francisco" } }
      ]
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in San Francisco.",
      tool_call_id: "call_1"
    }),
    new AIMessage("The weather in San Francisco is 75 degrees and sunny."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory
  });
  // {
  //     'key': 'trajectory_strict_match',
  //     'score': true,
  //     'comment': null,
  // }
  expect(evaluation.score).toBe(true);
}
The unordered mode allows the same tool calls in any order, which is helpful when you want to verify that specific information was retrieved but don’t care about the sequence. For example, an agent might need to check both weather and events for a city, but the order doesn’t matter.
import { createAgent, tool, HumanMessage, AIMessage, ToolMessage } from "langchain"
import { createTrajectoryMatchEvaluator } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }: { city: string }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const getEvents = tool(
  async ({ city }: { city: string }) => {
    return `Concert at the park in ${city} tonight.`;
  },
  {
    name: "get_events",
    description: "Get events happening in a city.",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "gpt-4o",
  tools: [getWeather, getEvents]
});

const evaluator = createTrajectoryMatchEvaluator({  
  trajectoryMatchMode: "unordered",  
});  

async function testMultipleToolsAnyOrder() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's happening in SF today?")]
  });

  // Reference shows tools called in different order than actual execution
  const referenceTrajectory = [
    new HumanMessage("What's happening in SF today?"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_events", args: { city: "SF" } },
        { id: "call_2", name: "get_weather", args: { city: "SF" } },
      ]
    }),
    new ToolMessage({
      content: "Concert at the park in SF tonight.",
      tool_call_id: "call_1"
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in SF.",
      tool_call_id: "call_2"
    }),
    new AIMessage("Today in SF: 75 degrees and sunny with a concert at the park tonight."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory,
  });
  // {
  //     'key': 'trajectory_unordered_match',
  //     'score': true,
  // }
  expect(evaluation.score).toBe(true);
}
The superset and subset modes match partial trajectories. The superset mode verifies that the agent called at least the tools in the reference trajectory, allowing additional tool calls. The subset mode ensures the agent did not call any tools beyond those in the reference.
import { createAgent } from "langchain"
import { tool } from "@langchain/core/tools";
import { HumanMessage, AIMessage, ToolMessage } from "@langchain/core/messages";
import { createTrajectoryMatchEvaluator } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }: { city: string }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const getDetailedForecast = tool(
  async ({ city }: { city: string }) => {
    return `Detailed forecast for ${city}: sunny all week.`;
  },
  {
    name: "get_detailed_forecast",
    description: "Get detailed weather forecast for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "gpt-4o",
  tools: [getWeather, getDetailedForecast]
});

const evaluator = createTrajectoryMatchEvaluator({  
  trajectoryMatchMode: "superset",  
});  

async function testAgentCallsRequiredToolsPlusExtra() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's the weather in Boston?")]
  });

  // Reference only requires getWeather, but agent may call additional tools
  const referenceTrajectory = [
    new HumanMessage("What's the weather in Boston?"),
    new AIMessage({
      content: "",
      tool_calls: [
        { id: "call_1", name: "get_weather", args: { city: "Boston" } },
      ]
    }),
    new ToolMessage({
      content: "It's 75 degrees and sunny in Boston.",
      tool_call_id: "call_1"
    }),
    new AIMessage("The weather in Boston is 75 degrees and sunny."),
  ];

  const evaluation = await evaluator({
    outputs: result.messages,
    referenceOutputs: referenceTrajectory,
  });
  // {
  //     'key': 'trajectory_superset_match',
  //     'score': true,
  //     'comment': null,
  // }
  expect(evaluation.score).toBe(true);
}
You can also set the toolArgsMatchMode property and/or toolArgsMatchOverrides to customize how the evaluator considers equality between tool calls in the actual trajectory vs. the reference. By default, only tool calls with the same arguments to the same tool are considered equal. Visit the repository for more details.

LLM-as-Judge 评估器

You can also use an LLM to evaluate the agent’s execution path with the createTrajectoryLLMAsJudge function. Unlike the trajectory match evaluators, it doesn’t require a reference trajectory, but one can be provided if available.
import { createAgent } from "langchain"
import { tool } from "@langchain/core/tools";
import { HumanMessage, AIMessage, ToolMessage } from "@langchain/core/messages";
import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";
import * as z from "zod";

const getWeather = tool(
  async ({ city }: { city: string }) => {
    return `It's 75 degrees and sunny in ${city}.`;
  },
  {
    name: "get_weather",
    description: "Get weather information for a city.",
    schema: z.object({ city: z.string() }),
  }
);

const agent = createAgent({
  model: "gpt-4o",
  tools: [getWeather]
});

const evaluator = createTrajectoryLLMAsJudge({  
  model: "openai:o3-mini",  
  prompt: TRAJECTORY_ACCURACY_PROMPT,  
});  

async function testTrajectoryQuality() {
  const result = await agent.invoke({
    messages: [new HumanMessage("What's the weather in Seattle?")]
  });

  const evaluation = await evaluator({
    outputs: result.messages,
  });
  // {
  //     'key': 'trajectory_accuracy',
  //     'score': true,
  //     'comment': 'The provided agent trajectory is reasonable...'
  // }
  expect(evaluation.score).toBe(true);
}
If you have a reference trajectory, you can add an extra variable to your prompt and pass in the reference trajectory. Below, we use the prebuilt TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE prompt and configure the reference_outputs variable:
import { TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE } from "agentevals";

const evaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
});

const evaluation = await evaluator({
  outputs: result.messages,
  referenceOutputs: referenceTrajectory,
});
For more configurability over how the LLM evaluates the trajectory, visit the repository.

LangSmith 集成

为了长期跟踪实验,您可以将评估器结果记录到 LangSmith,这是一个用于构建生产级 LLM 应用程序的平台,包括跟踪、评估和实验工具。 首先,通过设置所需的环境变量来设置 LangSmith:
export LANGSMITH_API_KEY="your_langsmith_api_key"
export LANGSMITH_TRACING="true"
LangSmith offers two main approaches for running evaluations: Vitest/Jest integration and the evaluate function.
import * as ls from "langsmith/vitest";
// import * as ls from "langsmith/jest";

import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";

const trajectoryEvaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT,
});

ls.describe("trajectory accuracy", () => {
  ls.test("accurate trajectory", {
    inputs: {
      messages: [
        {
          role: "user",
          content: "What is the weather in SF?"
        }
      ]
    },
    referenceOutputs: {
      messages: [
        new HumanMessage("What is the weather in SF?"),
        new AIMessage({
          content: "",
          tool_calls: [
            { id: "call_1", name: "get_weather", args: { city: "SF" } }
          ]
        }),
        new ToolMessage({
          content: "It's 75 degrees and sunny in SF.",
          tool_call_id: "call_1"
        }),
        new AIMessage("The weather in SF is 75 degrees and sunny."),
      ],
    },
  }, async ({ inputs, referenceOutputs }) => {
    const result = await agent.invoke({
      messages: [new HumanMessage("What is the weather in SF?")]
    });

    ls.logOutputs({ messages: result.messages });

    await trajectoryEvaluator({
      inputs,
      outputs: result.messages,
      referenceOutputs,
    });
  });
});
Run the evaluation with your test runner:
vitest run test_trajectory.eval.ts
# or
jest test_trajectory.eval.ts
Alternatively, you can create a dataset in LangSmith and use the evaluate function:
import { evaluate } from "langsmith/evaluation";
import { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT } from "agentevals";

const trajectoryEvaluator = createTrajectoryLLMAsJudge({
  model: "openai:o3-mini",
  prompt: TRAJECTORY_ACCURACY_PROMPT,
});

async function runAgent(inputs: any) {
  const result = await agent.invoke(inputs);
  return result.messages;
}

await evaluate(
  runAgent,
  {
    data: "your_dataset_name",
    evaluators: [trajectoryEvaluator],
  }
);
Results will be automatically logged to LangSmith.
To learn more about evaluating your agent, see the LangSmith docs.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.