概述

本教程将使您熟悉 LangChain 的文档加载器嵌入向量存储抽象。这些抽象旨在支持从(向量)数据库和其他来源检索数据,以便与 LLM 工作流程集成。它们对于获取数据作为模型推理的一部分进行推理的应用程序很重要,如检索增强生成或 RAG 的情况。 在这里,我们将在 PDF 文档上构建搜索引擎。这将允许我们检索 PDF 中与输入查询相似的段落。该指南还包括在搜索引擎之上的最小 RAG 实现。

概念

本指南侧重于检索文本数据。我们将涵盖以下概念:

设置

安装

本教程需要 langchain-communitypypdf 包:
pip install langchain-community pypdf
有关更多详细信息,请参阅我们的安装指南

LangSmith

您使用 LangChain 构建的许多应用程序将包含多个步骤和多次 LLM 调用。 随着这些应用程序变得越来越复杂,能够检查链或智能体内部究竟发生了什么变得至关重要。 执行此操作的最佳方法是使用 LangSmith 在上面的链接注册后,请确保设置您的环境变量以开始记录跟踪:
export LANGSMITH_TRACING="true"
export LANGSMITH_API_KEY="..."
或者,如果在笔记本中,您可以使用以下方式设置:
import getpass
import os

os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = getpass.getpass()

1. 文档和文档加载器

LangChain 实现了一个 Document 抽象,旨在表示文本单元和相关元数据。它有三个属性:
  • page_content:表示内容的字符串;
  • metadata:包含任意元数据的字典;
  • id:(可选)文档的字符串标识符。
metadata 属性可以捕获有关文档来源、与其他文档的关系以及其他信息。请注意,单个 Document 对象通常表示较大文档的一个块。 我们可以在需要时生成示例文档:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]
但是,LangChain 生态系统实现了文档加载器,这些加载器与数百个常见源集成。这使得将这些源的数据合并到您的 AI 应用程序中变得容易。

加载文档

让我们将 PDF 加载到一系列 Document 对象中。这是一个示例 PDF — 2023 年 Nike 的 10-k 文件。我们可以查阅 LangChain 文档以了解可用的 PDF 文档加载器
from langchain_community.document_loaders import PyPDFLoader

file_path = "../example_data/nke-10k-2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))
107
PyPDFLoader loads one Document object per PDF page. For each, we can easily access:
  • The string content of the page;
  • Metadata containing the file name and page number.
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FO

{'source': '../example_data/nke-10k-2023.pdf', 'page': 0}

分割

对于信息检索和下游问答目的,页面可能过于粗糙。我们的最终目标是检索回答输入查询的 Document 对象,进一步分割我们的 PDF 将有助于确保文档相关部分的含义不会被周围的文本”冲淡”。 我们可以为此目的使用文本分割器。这里我们将使用一个基于字符进行分区的简单文本分割器。我们将文档分割成 1000 个字符的块,块之间有 200 个字符的重叠。重叠有助于减少将语句与其相关的重要上下文分离的可能性。我们使用 RecursiveCharacterTextSplitter,它将使用常见分隔符(如换行符)递归分割文档,直到每个块达到适当的大小。这是通用文本用例推荐的文本分割器。 我们设置 add_start_index=True,以便将每个分割的 Document 在初始 Document 中开始的字符索引保存为元数据属性 “start_index”。
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

print(len(all_splits))
514

2. Embeddings

Vector search is a common way to store and search over unstructured data (such as unstructured text). The idea is to store numeric vectors that are associated with the text. Given a query, we can embed it as a vector of the same dimension and use vector similarity metrics (such as cosine similarity) to identify related text. LangChain supports embeddings from dozens of providers. These models specify how text should be converted into a numeric vector. Let’s select a model:
  • OpenAI
  • Azure
  • Google Gemini
  • Google Vertex
  • AWS
  • HuggingFace
  • Ollama
  • Cohere
  • MistralAI
  • Nomic
  • NVIDIA
  • Voyage AI
  • IBM watsonx
  • Fake
pip install -U "langchain-openai"
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])
Generated vectors of length 1536

[-0.008586574345827103, -0.03341241180896759, -0.008936782367527485, -0.0036674530711025, 0.010564599186182022, 0.009598285891115665, -0.028587326407432556, -0.015824200585484505, 0.0030416189692914486, -0.012899317778646946]
Armed with a model for generating text embeddings, we can next store them in a special data structure that supports efficient similarity search.

3. Vector stores

LangChain VectorStore objects contain methods for adding text and Document objects to the store, and querying them using various similarity metrics. They are often initialized with embedding models, which determine how text data is translated to numeric vectors. LangChain includes a suite of integrations with different vector store technologies. Some vector stores are hosted by a provider (e.g., various cloud providers) and require specific credentials to use; some (such as Postgres) run in separate infrastructure that can be run locally or via a third-party; others can run in-memory for lightweight workloads. Let’s select a vector store:
  • In-memory
  • AstraDB
  • Chroma
  • FAISS
  • Milvus
  • MongoDB
  • PGVector
  • PGVectorStore
  • Pinecone
  • Qdrant
pip install -U "langchain-core"
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)
Having instantiated our vector store, we can now index the documents.
ids = vector_store.add_documents(documents=all_splits)
Note that most vector store implementations will allow you to connect to an existing vector store— e.g., by providing a client, index name, or other information. See the documentation for a specific integration for more detail. Once we’ve instantiated a VectorStore that contains documents, we can query it. VectorStore includes methods for querying:
  • Synchronously and asynchronously;
  • By string query and by vector;
  • With and without returning similarity scores;
  • By similarity and @[maximum marginal relevance][VectorStore.max_marginal_relevance_search] (to balance similarity with query to diversity in retrieved results).
The methods will generally include a list of Document objects in their outputs. Usage Embeddings typically represent text as a “dense” vector such that texts with similar meanings are geometrically close. This lets us retrieve relevant information just by passing in a question, without knowledge of any specific key-terms used in the document. Return documents based on similarity to a string query:
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])
page_content='direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213
NIKE Brand in-line stores (including employee-only stores) 74
Converse stores (including factory stores) 82
TOTAL 369
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.
2023 FORM 10-K 2' metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}
Async query:
results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])
page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}
Return scores:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)
Score: 0.23699893057346344

page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS
The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metadata={'page': 35, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}
Return documents based on similarity to an embedded query:
embedding = embeddings.embed_query("How were Nike's margins impacted in 2023?")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])
page_content='Table of Contents
GROSS MARGIN
FISCAL 2023 COMPARED TO FISCAL 2022
For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to
43.5% for fiscal 2023 compared to 46.0% for fiscal 2022 due to the following:
*Wholesale equivalent
The decrease in gross margin for fiscal 2023 was primarily due to:
•Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as
product mix;
•Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in
the prior period resulting from lower available inventory supply;
•Unfavorable changes in net foreign currency exchange rates, including hedges; and
•Lower off-price margin, on a wholesale equivalent basis.
This was partially offset by:' metadata={'page': 36, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}
Learn more:

4. Retrievers

LangChain VectorStore objects do not subclass @[Runnable]. LangChain @[Retrievers] are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous invoke and batch operations). Although we can construct retrievers from vector stores, retrievers can interface with non-vector store sources of data, as well (such as external APIs). We can create a simple version of this ourselves, without subclassing Retriever. If we choose what method we wish to use to retrieve documents, we can create a runnable easily. Below we will build one around the similarity_search method:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain


@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)
[[Document(metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
 [Document(metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}, page_content='Table of Contents\nPART I\nITEM 1. BUSINESS\nGENERAL\nNIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\nOur principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\nthe largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\nand sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales')]]
Vectorstores implement an as_retriever method that will generate a Retriever, specifically a VectorStoreRetriever. These retrievers include specific search_type and search_kwargs attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)
[[Document(metadata={'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')],
 [Document(metadata={'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}, page_content='Table of Contents\nPART I\nITEM 1. BUSINESS\nGENERAL\nNIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\nOur principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\nthe largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\nand sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales')]]
VectorStoreRetriever supports search types of "similarity" (default), "mmr" (maximum marginal relevance, described above), and "similarity_score_threshold". We can use the latter to threshold documents output by the retriever by similarity score. Retrievers can easily be incorporated into more complex applications, such as retrieval-augmented generation (RAG) applications that combine a given question with retrieved context into a prompt for a LLM. To learn more about building such an application, check out the RAG tutorial tutorial.

后续步骤

您现在已经了解了如何在 PDF 文档上构建语义搜索引擎。 有关文档加载器的更多信息: 有关嵌入的更多信息: 有关向量存储的更多信息: 有关 RAG 的更多信息,请参阅:
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.