如何评估应用程序的中间步骤

虽然在许多情况下评估任务的最终输出就足够了，但在某些情况下，您可能希望评估管道的中间步骤。例如，对于检索增强生成 (RAG)，您可能希望

评估检索步骤以确保检索到的文档相对于输入查询是正确的。
评估生成步骤以确保生成的答案相对于检索到的文档是正确的。

在本指南中，我们将使用一个简单的完全自定义评估器来评估标准 1，使用基于 LLM 的评估器来评估标准 2，以突出两种情况。为了评估管道的中间步骤，您的评估器函数应遍历和处理 run/rootRun 参数，这是一个包含管道中间步骤的 Run 对象。

1. 定义您的 LLM 管道

以下 RAG 管道包括 1) 给定输入问题生成 Wikipedia 查询，2) 从 Wikipedia 检索相关文档，以及 3) 给定检索到的文档生成答案。

pip install -U langsmith langchain[openai] wikipedia

Requires langsmith>=0.3.13

import wikipedia as wp
from openai import OpenAI
from langsmith import traceable, wrappers

oai_client = wrappers.wrap_openai(OpenAI())

@traceable
def generate_wiki_search(question: str) -> str:
    """Generate the query to search in wikipedia."""
    instructions = (
        "Generate a search query to pass into wikipedia to answer the user's question. "
        "Return only the search query and nothing more. "
        "This will passed in directly to the wikipedia search engine."
    )
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": question}
    ]
    result = oai_client.chat.completions.create(
        messages=messages,
        model="gpt-4o-mini",
        temperature=0,
    )
    return result.choices[0].message.content

@traceable(run_type="retriever")
def retrieve(query: str) -> list:
    """Get up to two search wikipedia results."""
    results = []
    for term in wp.search(query, results = 10):
        try:
            page = wp.page(term, auto_suggest=False)
            results.append({
                "page_content": page.summary,
                "type": "Document",
                "metadata": {"url": page.url}
            })
        except wp.DisambiguationError:
            pass
        if len(results) >= 2:
            return results

@traceable
def generate_answer(question: str, context: str) -> str:
    """Answer the question based on the retrieved information."""
    instructions = f"Answer the user's question based ONLY on the content below:\n\n{context}"
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": question}
    ]
    result = oai_client.chat.completions.create(
        messages=messages,
        model="gpt-4o-mini",
        temperature=0
    )
    return result.choices[0].message.content

@traceable
def qa_pipeline(question: str) -> str:
    """The full pipeline."""
    query = generate_wiki_search(question)
    context = "\n\n".join([doc["page_content"] for doc in retrieve(query)])
    return generate_answer(question, context)

This pipeline will produce a trace that looks something like:

2. Create a dataset and examples to evaluate the pipeline

We are building a very simple dataset with a couple of examples to evaluate the pipeline. Requires langsmith>=0.3.13

from langsmith import Client

ls_client = Client()
dataset_name = "Wikipedia RAG"

if not ls_client.has_dataset(dataset_name=dataset_name):
    dataset = ls_client.create_dataset(dataset_name=dataset_name)
    examples = [
      {"inputs": {"question": "What is LangChain?"}},
      {"inputs": {"question": "What is LangSmith?"}},
    ]
    ls_client.create_examples(
      dataset_id=dataset.id,
      examples=examples,
    )

3. Define your custom evaluators

As mentioned above, we will define two evaluators: one that evaluates the relevance of the retrieved documents w.r.t the input query and another that evaluates the hallucination of the generated answer w.r.t the retrieved documents. We will be using LangChain LLM wrappers, along with with_structured_output to define the evaluator for hallucination. The key here is that the evaluator function should traverse the run / rootRun argument to access the intermediate steps of the pipeline. The evaluator can then process the inputs and outputs of the intermediate steps to evaluate according to the desired criteria. Example uses langchain for convenience, this is not required.

from langchain.chat_models import init_chat_model
from langsmith.schemas import Run
from pydantic import BaseModel, Field

def document_relevance(run: Run) -> bool:
    """Checks if retriever input exists in the retrieved docs."""
    qa_pipeline_run = next(
        r for run in run.child_runs if r.name == "qa_pipeline"
    )
    retrieve_run = next(
        r for run in qa_pipeline_run.child_runs if r.name == "retrieve"
    )
    page_contents = "\n\n".join(
        doc["page_content"] for doc in retrieve_run.outputs["output"]
    )
    return retrieve_run.inputs["query"] in page_contents

# Data model
class GradeHallucinations(BaseModel):
    """Binary score for hallucination present in generation answer."""
    is_grounded: bool = Field(..., description="True if the answer is grounded in the facts, False otherwise.")

# LLM with structured outputs for grading hallucinations
# For more see: https://python.langchain.com/docs/how_to/structured_output/
grader_llm= init_chat_model("gpt-4o-mini", temperature=0).with_structured_output(
    GradeHallucinations,
    method="json_schema",
    strict=True,
)

def no_hallucination(run: Run) -> bool:
    """Check if the answer is grounded in the documents.
    Return True if there is no hallucination, False otherwise.
    """
    # Get documents and answer
    qa_pipeline_run = next(
        r for r in run.child_runs if r.name == "qa_pipeline"
    )
    retrieve_run = next(
        r for r in qa_pipeline_run.child_runs if r.name == "retrieve"
    )
    retrieved_content = "\n\n".join(
        doc["page_content"] for doc in retrieve_run.outputs["output"]
    )

    # Construct prompt
    instructions = (
        "You are a grader assessing whether an LLM generation is grounded in / "
        "supported by a set of retrieved facts. Give a binary score 1 or 0, "
        "where 1 means that the answer is grounded in / supported by the set of facts."
    )
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": f"Set of facts:\n{retrieved_content}\n\nLLM generation: {run.outputs['answer']}"},
    ]
    grade = grader_llm.invoke(messages)
    return grade.is_grounded

4. Evaluate the pipeline

Finally, we’ll run evaluate with the custom evaluators defined above.

def qa_wrapper(inputs: dict) -> dict:
  """Wrap the qa_pipeline so it can accept the Example.inputs dict as input."""
  return {"answer": qa_pipeline(inputs["question"])}

experiment_results = ls_client.evaluate(
    qa_wrapper,
    data=dataset_name,
    evaluators=[document_relevance, no_hallucination],
    experiment_prefix="rag-wiki-oai"
)

The experiment will contain the results of the evaluation, including the scores and comments from the evaluators:

Evaluate a langgraph graph

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

如何评估应用程序的中间步骤

1. 定义您的 LLM 管道

2. Create a dataset and examples to evaluate the pipeline

3. Define your custom evaluators

4. Evaluate the pipeline

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

​1. 定义您的 LLM 管道

​2. Create a dataset and examples to evaluate the pipeline

​3. Define your custom evaluators

​4. Evaluate the pipeline

​Related

1. 定义您的 LLM 管道

2. Create a dataset and examples to evaluate the pipeline

3. Define your custom evaluators

4. Evaluate the pipeline

Related