概述
本教程将使您熟悉 LangChain 的文档加载器、嵌入和向量存储抽象。这些抽象旨在支持从(向量)数据库和其他来源检索数据,以便与 LLM 工作流程集成。它们对于获取数据作为模型推理的一部分进行推理的应用程序很重要,如检索增强生成或 RAG 的情况。 在这里,我们将在 PDF 文档上构建搜索引擎。这将允许我们检索 PDF 中与输入查询相似的段落。该指南还包括在搜索引擎之上的最小 RAG 实现。概念
本指南侧重于检索文本数据。我们将涵盖以下概念:设置
安装
本教程需要langchain-community 和 pypdf 包:
LangSmith
您使用 LangChain 构建的许多应用程序将包含多个步骤和多次 LLM 调用。 随着这些应用程序变得越来越复杂,能够检查链或智能体内部究竟发生了什么变得至关重要。 执行此操作的最佳方法是使用 LangSmith。 在上面的链接注册后,请确保设置您的环境变量以开始记录跟踪:1. 文档和文档加载器
LangChain 实现了一个 Document 抽象,旨在表示文本单元和相关元数据。它有三个属性:page_content:表示内容的字符串;metadata:包含任意元数据的字典;id:(可选)文档的字符串标识符。
metadata 属性可以捕获有关文档来源、与其他文档的关系以及其他信息。请注意,单个 Document 对象通常表示较大文档的一个块。
我们可以在需要时生成示例文档:
加载文档
让我们将 PDF 加载到一系列Document 对象中。这是一个示例 PDF — 2023 年 Nike 的 10-k 文件。我们可以查阅 LangChain 文档以了解可用的 PDF 文档加载器。
PyPDFLoader loads one Document object per PDF page. For each, we can easily access:
- The string content of the page;
- Metadata containing the file name and page number.
分割
对于信息检索和下游问答目的,页面可能过于粗糙。我们的最终目标是检索回答输入查询的Document 对象,进一步分割我们的 PDF 将有助于确保文档相关部分的含义不会被周围的文本”冲淡”。
我们可以为此目的使用文本分割器。这里我们将使用一个基于字符进行分区的简单文本分割器。我们将文档分割成 1000 个字符的块,块之间有 200 个字符的重叠。重叠有助于减少将语句与其相关的重要上下文分离的可能性。我们使用 RecursiveCharacterTextSplitter,它将使用常见分隔符(如换行符)递归分割文档,直到每个块达到适当的大小。这是通用文本用例推荐的文本分割器。
我们设置 add_start_index=True,以便将每个分割的 Document 在初始 Document 中开始的字符索引保存为元数据属性 “start_index”。
2. Embeddings
Vector search is a common way to store and search over unstructured data (such as unstructured text). The idea is to store numeric vectors that are associated with the text. Given a query, we can embed it as a vector of the same dimension and use vector similarity metrics (such as cosine similarity) to identify related text. LangChain supports embeddings from dozens of providers. These models specify how text should be converted into a numeric vector. Let’s select a model:- OpenAI
- Azure
- Google Gemini
- Google Vertex
- AWS
- HuggingFace
- Ollama
- Cohere
- MistralAI
- Nomic
- NVIDIA
- Voyage AI
- IBM watsonx
- Fake
3. Vector stores
LangChain VectorStore objects contain methods for adding text andDocument objects to the store, and querying them using various similarity metrics. They are often initialized with embedding models, which determine how text data is translated to numeric vectors.
LangChain includes a suite of integrations with different vector store technologies. Some vector stores are hosted by a provider (e.g., various cloud providers) and require specific credentials to use; some (such as Postgres) run in separate infrastructure that can be run locally or via a third-party; others can run in-memory for lightweight workloads. Let’s select a vector store:
- In-memory
- AstraDB
- Chroma
- FAISS
- Milvus
- MongoDB
- PGVector
- PGVectorStore
- Pinecone
- Qdrant
VectorStore that contains documents, we can query it. VectorStore includes methods for querying:
- Synchronously and asynchronously;
- By string query and by vector;
- With and without returning similarity scores;
- By similarity and @[maximum marginal relevance][VectorStore.max_marginal_relevance_search] (to balance similarity with query to diversity in retrieved results).
4. Retrievers
LangChainVectorStore objects do not subclass @[Runnable]. LangChain @[Retrievers] are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous invoke and batch operations). Although we can construct retrievers from vector stores, retrievers can interface with non-vector store sources of data, as well (such as external APIs).
We can create a simple version of this ourselves, without subclassing Retriever. If we choose what method we wish to use to retrieve documents, we can create a runnable easily. Below we will build one around the similarity_search method:
as_retriever method that will generate a Retriever, specifically a VectorStoreRetriever. These retrievers include specific search_type and search_kwargs attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:
VectorStoreRetriever supports search types of "similarity" (default), "mmr" (maximum marginal relevance, described above), and "similarity_score_threshold". We can use the latter to threshold documents output by the retriever by similarity score.
Retrievers can easily be incorporated into more complex applications, such as retrieval-augmented generation (RAG) applications that combine a given question with retrieved context into a prompt for a LLM. To learn more about building such an application, check out the RAG tutorial tutorial.