运行评估需要三个主要部分:
  1. 测试输入和预期输出的数据集
  2. 目标函数,这是您要评估的内容。
  3. 对目标函数输出进行评分的评估器
本指南向您展示如何根据您正在评估的应用程序部分定义目标函数。有关如何创建数据集如何定义评估器的信息,请参见此处,有关运行评估的端到端示例,请参见此处。

目标函数签名

为了在代码中评估应用程序,我们需要一种运行应用程序的方法。使用 evaluate()Python/TypeScript)时,我们将通过传入目标函数参数来执行此操作。这是一个接受数据集 Example’s 输入并将应用程序输出作为字典返回的函数。在此函数中,我们可以以任何我们喜欢的方式调用应用程序。我们还可以以任何我们喜欢的方式格式化输出。关键是我们定义的任何评估器函数都应该与我们在目标函数中返回的输出格式一起工作。
from langsmith import Client

# 'inputs' will come from your dataset.
def dummy_target(inputs: dict) -> dict:
    return {"foo": 1, "bar": "two"}

# 'inputs' will come from your dataset.
# 'outputs' will come from your target function.
def evaluator_one(inputs: dict, outputs: dict) -> bool:
    return outputs["foo"] == 2

def evaluator_two(inputs: dict, outputs: dict) -> bool:
    return len(outputs["bar"]) < 3

client = Client()
results = client.evaluate(
    dummy_target,  # <-- target function
    data="your-dataset-name",
    evaluators=[evaluator_one, evaluator_two],
    ...
)
evaluate() 将自动跟踪您的目标函数。这意味着如果您在目标函数中运行任何可跟踪的代码,这也将作为目标跟踪的子运行进行跟踪。

示例:单个 LLM 调用

from langsmith import wrappers
from openai import OpenAI

# Optionally wrap the OpenAI client to automatically
# trace all model calls.
oai_client = wrappers.wrap_openai(OpenAI())

def target(inputs: dict) -> dict:
  # This assumes your dataset has inputs with a 'messages' key.
  # You can update to match your dataset schema.
  messages = inputs["messages"]
  response = oai_client.chat.completions.create(
      messages=messages,
      model="gpt-4o-mini",
  )
  return {"answer": response.choices[0].message.content}

Example: Non-LLM component

from langsmith import traceable

# Optionally decorate with '@traceable' to trace all invocations of this function.
@traceable
def calculator_tool(operation: str, number1: float, number2: float) -> str:
  if operation == "add":
      return str(number1 + number2)
  elif operation == "subtract":
      return str(number1 - number2)
  elif operation == "multiply":
      return str(number1 * number2)
  elif operation == "divide":
      return str(number1 / number2)
  else:
      raise ValueError(f"Unrecognized operation: {operation}.")

# This is the function you will evaluate.
def target(inputs: dict) -> dict:
  # This assumes your dataset has inputs with `operation`, `num1`, and `num2` keys.
  operation = inputs["operation"]
  number1 = inputs["num1"]
  number2 = inputs["num2"]
  result = calculator_tool(operation, number1, number2)
  return {"result": result}

Example: Application or agent

from my_agent import agent

      # This is the function you will evaluate.
def target(inputs: dict) -> dict:
  # This assumes your dataset has inputs with a `messages` key
  messages = inputs["messages"]
  # Replace `invoke` with whatever you use to call your agent
  response = agent.invoke({"messages": messages})
  # This assumes your agent output is in the right format
  return response
If you have a LangGraph/LangChain agent that accepts the inputs defined in your dataset and that returns the output format you want to use in your evaluators, you can pass that object in as the target directly:
from my_agent import agent
from langsmith import Client
client = Client()
client.evaluate(agent, ...)

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.