概述

本教程将使您熟悉 LangChain 的文档加载器嵌入向量存储抽象。这些抽象旨在支持从(向量)数据库和其他来源检索数据,以便与 LLM 工作流程集成。它们对于获取数据作为模型推理的一部分进行推理的应用程序很重要,如检索增强生成或 RAG 的情况。 在这里,我们将在 PDF 文档上构建搜索引擎。这将允许我们检索 PDF 中与输入查询相似的段落。该指南还包括在搜索引擎之上的最小 RAG 实现。

概念

本指南侧重于检索文本数据。我们将涵盖以下概念:

设置

安装

本指南需要 @langchain/communitypdf-parse
npm i @langchain/community pdf-parse
有关更多详细信息,请参阅我们的安装指南

LangSmith

您使用 LangChain 构建的许多应用程序将包含多个步骤和多次 LLM 调用。 随着这些应用程序变得越来越复杂,能够检查链或智能体内部究竟发生了什么变得至关重要。 执行此操作的最佳方法是使用 LangSmith 在上面的链接注册后,请确保设置您的环境变量以开始记录跟踪:
export LANGSMITH_TRACING="true"
export LANGSMITH_API_KEY="..."

1. 文档和文档加载器

LangChain 实现了一个 Document 抽象,旨在表示文本单元和相关元数据。它有三个属性:
  • pageContent:表示内容的字符串;
  • metadata:包含任意元数据的字典;
  • id:(可选)文档的字符串标识符。
metadata 属性可以捕获有关文档来源、与其他文档的关系以及其他信息。请注意,单个 Document 对象通常表示较大文档的一个块。 我们可以在需要时生成示例文档:
import { Document } from "@langchain/core/documents";

const documents = [
  new Document({
    pageContent:
      "Dogs are great companions, known for their loyalty and friendliness.",
    metadata: { source: "mammal-pets-doc" },
  }),
  new Document({
    pageContent: "Cats are independent pets that often enjoy their own space.",
    metadata: { source: "mammal-pets-doc" },
  }),
];
但是,LangChain 生态系统实现了文档加载器,这些加载器与数百个常见源集成。这使得将这些源的数据合并到您的 AI 应用程序中变得容易。

加载文档

让我们将 PDF 加载到一系列 Document 对象中。这是一个示例 PDF — 2023 年 Nike 的 10-k 文件。我们可以查阅 LangChain 文档以了解可用的 PDF 文档加载器
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const loader = new PDFLoader("../../data/nke-10k-2023.pdf");

const docs = await loader.load();
console.log(docs.length);
107
PDFLoader loads one Document object per PDF page. For each, we can easily access:
  • The string content of the page;
  • Metadata containing the file name and page number.
console.log(docs[0].pageContent.slice(0, 200));
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FO
console.log(docs[0].metadata);
{
  source: '../../data/nke-10k-2023.pdf',
  pdf: {
    version: '1.10.100',
    info: {
      PDFFormatVersion: '1.4',
      IsAcroFormPresent: false,
      IsXFAPresent: false,
      Title: '0000320187-23-000039',
      Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
      Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
      Keywords: '0000320187-23-000039; ; 10-K',
      Creator: 'EDGAR Filing HTML Converter',
      Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
      CreationDate: "D:20230720162200-04'00'",
      ModDate: "D:20230720162208-04'00'"
    },
    metadata: null,
    totalPages: 107
  },
  loc: { pageNumber: 1 }
}

分割

对于信息检索和下游问答目的,页面可能过于粗糙。我们的最终目标是检索回答输入查询的 Document 对象,进一步分割我们的 PDF 将有助于确保文档相关部分的含义不会被周围的文本”冲淡”。 我们可以为此目的使用文本分割器。这里我们将使用一个基于字符进行分区的简单文本分割器。我们将文档分割成 1000 个字符的块,块之间有 200 个字符的重叠。重叠有助于减少将语句与其相关的重要上下文分离的可能性。我们使用 RecursiveCharacterTextSplitter,它将使用常见分隔符(如换行符)递归分割文档,直到每个块达到适当的大小。这是通用文本用例推荐的文本分割器。
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
});

const allSplits = await textSplitter.splitDocuments(docs);

console.log(allSplits.length);
514

2. Embeddings

Vector search is a common way to store and search over unstructured data (such as unstructured text). The idea is to store numeric vectors that are associated with the text. Given a query, we can embed it as a vector of the same dimension and use vector similarity metrics (such as cosine similarity) to identify related text. LangChain supports embeddings from dozens of providers. These models specify how text should be converted into a numeric vector. Let’s select a model:
  • OpenAI
  • Azure
  • AWS
  • VertexAI
  • MistralAI
  • Cohere
npm i @langchain/openai
import { OpenAIEmbeddings } from "@langchain/openai";

const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-large"
});
const vector1 = await embeddings.embedQuery(allSplits[0].pageContent);
const vector2 = await embeddings.embedQuery(allSplits[1].pageContent);

assert vector1.length === vector2.length;
console.log(`Generated vectors of length ${vector1.length}\n`);
console.log(vector1.slice(0, 10));
Generated vectors of length 1536

[-0.008586574345827103, -0.03341241180896759, -0.008936782367527485, -0.0036674530711025, 0.010564599186182022, 0.009598285891115665, -0.028587326407432556, -0.015824200585484505, 0.0030416189692914486, -0.012899317778646946]
Armed with a model for generating text embeddings, we can next store them in a special data structure that supports efficient similarity search.

3. Vector stores

LangChain @[VectorStore] objects contain methods for adding text and Document objects to the store, and querying them using various similarity metrics. They are often initialized with embedding models, which determine how text data is translated to numeric vectors. LangChain includes a suite of integrations with different vector store technologies. Some vector stores are hosted by a provider (e.g., various cloud providers) and require specific credentials to use; some (such as Postgres) run in separate infrastructure that can be run locally or via a third-party; others can run in-memory for lightweight workloads. Let’s select a vector store:
  • Memory
  • Chroma
  • FAISS
  • MongoDB
  • PGVector
  • Pinecone
  • Qdrant
npm i @langchain/classic
import { MemoryVectorStore } from "@langchain/classic/vectorstores/memory";

const vectorStore = new MemoryVectorStore(embeddings);
Having instantiated our vector store, we can now index the documents.
await vectorStore.addDocuments(allSplits);
Note that most vector store implementations will allow you to connect to an existing vector store— e.g., by providing a client, index name, or other information. See the documentation for a specific integration for more detail. Once we’ve instantiated a @[VectorStore] that contains documents, we can query it. @[VectorStore] includes methods for querying:
  • Synchronously and asynchronously;
  • By string query and by vector;
  • With and without returning similarity scores;
  • By similarity and @[maximum marginal relevance][VectorStore.max_marginal_relevance_search] (to balance similarity with query to diversity in retrieved results).
The methods will generally include a list of Document objects in their outputs. Usage Embeddings typically represent text as a “dense” vector such that texts with similar meanings are geometrically close. This lets us retrieve relevant information just by passing in a question, without knowledge of any specific key-terms used in the document. Return documents based on similarity to a string query:
const results1 = await vectorStore.similaritySearch(
  "When was Nike incorporated?"
);

console.log(results1[0]);
Document {
    pageContent: 'direct to consumer operations sell products...',
    metadata: {'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125}
}
Return scores:
const results2 = await vectorStore.similaritySearchWithScore(
  "What was Nike's revenue in 2023?"
);

console.log(results2[0]);
Score: 0.23699893057346344

Document {
    pageContent: 'Table of Contents...',
    metadata: {'page': 35, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0}
}
Return documents based on similarity to an embedded query:
const embedding = await embeddings.embedQuery(
  "How were Nike's margins impacted in 2023?"
);

const results3 = await vectorStore.similaritySearchVectorWithScore(
  embedding,
  1
);

console.log(results3[0]);
Document {
    pageContent: 'FISCAL 2023 COMPARED TO FISCAL 2022...',
    metadata: {
        'page': 36,
        'source': '../example_data/nke-10k-2023.pdf',
        'start_index': 0
    }
}
Learn more:

4. Retrievers

LangChain @[VectorStore] objects do not subclass @[Runnable]. LangChain @[Retrievers] are Runnables, so they implement a standard set of methods (e.g., synchronous and asynchronous invoke and batch operations). Although we can construct retrievers from vector stores, retrievers can interface with non-vector store sources of data, as well (such as external APIs). Vectorstores implement an as_retriever method that will generate a Retriever, specifically a VectorStoreRetriever. These retrievers include specific search_type and search_kwargs attributes that identify what methods of the underlying vector store to call, and how to parameterize them. For instance, we can replicate the above with the following:
const retriever = vectorStore.asRetriever({
  searchType: "mmr",
  searchKwargs: {
    fetchK: 1,
  },
});

await retriever.batch([
  "When was Nike incorporated?",
  "What was Nike's revenue in 2023?",
]);
[
    [Document {
        metadata: {'page': 4, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 3125},
        pageContent: 'direct to consumer operations sell products...',
    }],
    [Document {
        metadata: {'page': 3, 'source': '../example_data/nke-10k-2023.pdf', 'start_index': 0},
        pageContent: 'Table of Contents...',
    }],
]
Retrievers can easily be incorporated into more complex applications, such as retrieval-augmented generation (RAG) applications that combine a given question with retrieved context into a prompt for a LLM. To learn more about building such an application, check out the RAG tutorial tutorial.

后续步骤

您现在已经了解了如何在 PDF 文档上构建语义搜索引擎。 有关文档加载器的更多信息: 有关嵌入的更多信息: 有关向量存储的更多信息: 有关 RAG 的更多信息,请参阅:
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.