Open In Colab
UpTrain [github || website || docs] 是一个用于评估和改进 LLM 应用的开源平台。它针对 20 多项预配置检查(涵盖语言、代码、嵌入等用例)提供评分,对失败案例进行根因分析,并提供解决指导。

UpTrain 回调处理程序

本笔记展示了 UpTrain 回调处理程序如何无缝集成到您的流水线中以执行多种评估。我们选择了一些适合用于评估链路的检查项,这些评估会自动运行,结果会直接显示。有关 UpTrain 评估的更多细节,请参阅此处 我们挑选了 LangChain 中的部分检索器进行演示:

1. Vanilla RAG

RAG 在检索上下文并生成响应方面至关重要。为确保其性能和响应质量,我们进行以下评估:

2. 多查询生成

MultiQueryRetriever 会创建多个语义与原始问题相同的变体。鉴于其复杂度,我们在之前的评估基础上新增:

3. 上下文压缩与重排

重排会根据与查询的相关性重新排序节点,并选择前 n 个节点。由于重排完成后节点数量可能减少,我们进行以下评估: 这些评估共同确保链中 RAG、MultiQueryRetriever 以及重排流程的稳健性与有效性。

安装依赖

pip install -qU langchain langchain_openai langchain-community uptrain faiss-cpu flashrank
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
 - Avoid using `tokenizers` before the fork if possible
 - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
WARNING: There was an error checking the latest version of pip.
Note: you may need to restart the kernel to use updated packages.
注意:如果想使用 GPU 版,可安装 faiss-gpu 替代 faiss-cpu

导入库

from getpass import getpass

from langchain.chains import RetrievalQA
from langchain_classic.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_classic.retrievers.document_compressors import FlashrankRerank
from langchain_classic.retrievers.multi_query import MultiQueryRetriever
from langchain_community.callbacks.uptrain_callback import UpTrainCallbackHandler
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers.string import StrOutputParser
from langchain_core.prompts.chat import ChatPromptTemplate
from langchain_core.runnables.passthrough import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
)

加载文档

loader = TextLoader("../../how_to/state_of_the_union.txt")
documents = loader.load()

将文档拆分为块

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
chunks = text_splitter.split_documents(documents)

创建检索器

embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(chunks, embeddings)
retriever = db.as_retriever()

定义 LLM

llm = ChatOpenAI(temperature=0, model="gpt-4")

设置

UpTrain 为您提供:
  1. 具有高级下钻和筛选功能的仪表板
  2. 失败案例的洞察和常见主题
  3. 生产数据的可观测性与实时监控
  4. 与 CI/CD 流水线无缝集成的回归测试
使用 UpTrain 进行评估时,可以在以下选项中选择:

1. UpTrain 开源软件(OSS)

您可以使用开源评估服务对模型进行评估。在这种情况下,需要提供 OpenAI API 密钥。UpTrain 使用 GPT 模型来评估 LLM 生成的响应,可在此处获取。 若希望在 UpTrain 仪表板中查看评估结果,需要运行以下命令进行设置:
git clone https://github.com/uptrain-ai/uptrain
cd uptrain
bash run_uptrain.sh
这会在本地启动 UpTrain 仪表板,可通过 http://localhost:3000/dashboard 访问。 参数:
  • key_type=“openai”
  • api_key=“OPENAI_API_KEY”
  • project_name=“PROJECT_NAME”

2. UpTrain 托管服务与仪表板

您也可以使用 UpTrain 的托管服务来评估模型。可在此处创建免费账号并获取试用额度。如需更多试用额度,请预约与 UpTrain 维护者的会议 使用托管服务的好处:
  1. 无需在本地设置 UpTrain 仪表板。
  2. 可访问多个 LLM,无需其 API 密钥。
执行评估后,可在 https://dashboard.uptrain.ai/dashboard 查看。 参数:
  • key_type=“uptrain”
  • api_key=“UPTRAIN_API_KEY”
  • project_name=“PROJECT_NAME”
注意: project_name 将作为 UpTrain 仪表板中显示评估结果的项目名称。

设置 API 密钥

笔记会提示输入 API 密钥。可通过更改下方单元格中的 key_type 参数选择 OpenAI 或 UpTrain 的 API 密钥。
KEY_TYPE = "openai"  # or "uptrain"
API_KEY = getpass()

1. Vanilla RAG

UpTrain callback handler will automatically capture the query, context and response once generated and will run the following three evaluations (Graded from 0 to 1) on the response:
# Create the RAG prompt
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
rag_prompt_text = ChatPromptTemplate.from_template(template)

# Create the chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | rag_prompt_text
    | llm
    | StrOutputParser()
)

# Create the uptrain callback handler
uptrain_callback = UpTrainCallbackHandler(key_type=KEY_TYPE, api_key=API_KEY)
config = {"callbacks": [uptrain_callback]}

# Invoke the chain with a query
query = "What did the president say about Ketanji Brown Jackson"
docs = chain.invoke(query, config=config)
2024-04-17 17:03:44.969 | INFO     | uptrain.framework.evalllm:evaluate_on_server:378 - Sending evaluation request for rows 0 to <50 to the Uptrain
2024-04-17 17:04:05.809 | INFO     | uptrain.framework.evalllm:evaluate:367 - Local server not running, start the server to log data and visualize in the dashboard!
Question: What did the president say about Ketanji Brown Jackson
Response: The president mentioned that he had nominated Ketanji Brown Jackson to serve on the United States Supreme Court 4 days ago. He described her as one of the nation's top legal minds who will continue Justice Breyer’s legacy of excellence. He also mentioned that she is a former top litigator in private practice, a former federal public defender, and comes from a family of public school educators and police officers. He described her as a consensus builder and noted that since her nomination, she has received a broad range of support from various groups, including the Fraternal Order of Police and former judges appointed by both Democrats and Republicans.

Context Relevance Score: 1.0
Factual Accuracy Score: 1.0
Response Completeness Score: 1.0

2. Multi Query Generation

The MultiQueryRetriever is used to tackle the problem that the RAG pipeline might not return the best set of documents based on the query. It generates multiple queries that mean the same as the original query and then fetches documents for each. To evaluate this retriever, UpTrain will run the following evaluation:
# Create the retriever
multi_query_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=llm)

# Create the uptrain callback
uptrain_callback = UpTrainCallbackHandler(key_type=KEY_TYPE, api_key=API_KEY)
config = {"callbacks": [uptrain_callback]}

# Create the RAG prompt
template = """Answer the question based only on the following context, which can include text and tables:
{context}
Question: {question}
"""
rag_prompt_text = ChatPromptTemplate.from_template(template)

chain = (
    {"context": multi_query_retriever, "question": RunnablePassthrough()}
    | rag_prompt_text
    | llm
    | StrOutputParser()
)

# Invoke the chain with a query
question = "What did the president say about Ketanji Brown Jackson"
docs = chain.invoke(question, config=config)
2024-04-17 17:04:10.675 | INFO     | uptrain.framework.evalllm:evaluate_on_server:378 - Sending evaluation request for rows 0 to <50 to the Uptrain
2024-04-17 17:04:16.804 | INFO     | uptrain.framework.evalllm:evaluate:367 - Local server not running, start the server to log data and visualize in the dashboard!
Question: What did the president say about Ketanji Brown Jackson
Multi Queries:
  - How did the president comment on Ketanji Brown Jackson?
  - What were the president's remarks regarding Ketanji Brown Jackson?
  - What statements has the president made about Ketanji Brown Jackson?

Multi Query Accuracy Score: 0.5
2024-04-17 17:04:22.027 | INFO     | uptrain.framework.evalllm:evaluate_on_server:378 - Sending evaluation request for rows 0 to <50 to the Uptrain
2024-04-17 17:04:44.033 | INFO     | uptrain.framework.evalllm:evaluate:367 - Local server not running, start the server to log data and visualize in the dashboard!
Question: What did the president say about Ketanji Brown Jackson
Response: The president mentioned that he had nominated Circuit Court of Appeals Judge Ketanji Brown Jackson to serve on the United States Supreme Court 4 days ago. He described her as one of the nation's top legal minds who will continue Justice Breyer’s legacy of excellence. He also mentioned that since her nomination, she has received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.

Context Relevance Score: 1.0
Factual Accuracy Score: 1.0
Response Completeness Score: 1.0

3. Context Compression and Reranking

The reranking process involves reordering nodes based on relevance to the query and choosing the top n nodes. Since the number of nodes can reduce once the reranking is complete, we perform the following evaluations:
  • Context Reranking: Check if the order of re-ranked nodes is more relevant to the query than the original order.
  • Context Conciseness: Check if the reduced number of nodes still provides all the required information.
# Create the retriever
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
        base_compressor=compressor, base_retriever=retriever
)

# Create the chain
chain = RetrievalQA.from_chain_type(llm=llm, retriever=compression_retriever)

# Create the uptrain callback
uptrain_callback = UpTrainCallbackHandler(key_type=KEY_TYPE, api_key=API_KEY)
config = {"callbacks": [uptrain_callback]}

# Invoke the chain with a query
query = "What did the president say about Ketanji Brown Jackson"
result = chain.invoke(query, config=config)
2024-04-17 17:04:46.462 | INFO     | uptrain.framework.evalllm:evaluate_on_server:378 - Sending evaluation request for rows 0 to <50 to the Uptrain
2024-04-17 17:04:53.561 | INFO     | uptrain.framework.evalllm:evaluate:367 - Local server not running, start the server to log data and visualize in the dashboard!
Question: What did the president say about Ketanji Brown Jackson

Context Conciseness Score: 0.0
Context Reranking Score: 1.0
2024-04-17 17:04:56.947 | INFO     | uptrain.framework.evalllm:evaluate_on_server:378 - Sending evaluation request for rows 0 to <50 to the Uptrain
2024-04-17 17:05:16.551 | INFO     | uptrain.framework.evalllm:evaluate:367 - Local server not running, start the server to log data and visualize in the dashboard!
Question: What did the president say about Ketanji Brown Jackson
Response: The President mentioned that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson to serve on the United States Supreme Court 4 days ago. He described her as one of the nation's top legal minds who will continue Justice Breyer’s legacy of excellence.

Context Relevance Score: 1.0
Factual Accuracy Score: 1.0
Response Completeness Score: 0.5

UpTrain’s Dashboard and Insights

Here’s a short video showcasing the dashboard and the insights: langchain_uptrain.gif
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.