按 token 分割

语言模型有 token 限制。您不应超过 token 限制。因此，当您将文本分割成块时，计算 token 数量是个好主意。有许多分词器。当您计算文本中的 token 时，应使用与语言模型中相同的分词器。

js-tiktoken

js-tiktoken 是由 OpenAI 创建的 BPE 分词器的 JavaScript 版本。

我们可以使用 tiktoken 通过 @[TokenTextSplitter] 估算使用的 token。对于 OpenAI 模型，它可能会更准确。

文本如何分割：通过传入的字符。
块大小如何测量：通过 tiktoken 分词器。

npm install @langchain/textsplitters

import { TokenTextSplitter } from "@langchain/textsplitters";
import { readFileSync } from "fs";

// 示例：读取长文档
const stateOfTheUnion = readFileSync("state_of_the_union.txt", "utf8");

要使用 @[TokenTextSplitter] 分割，然后使用 tiktoken 合并块，请在初始化 @[TokenTextSplitter] 时传入 encodingName（例如 cl100k_base）。请注意，此方法的分割可能大于 tiktoken 分词器测量的块大小。

import { TokenTextSplitter } from "@langchain/textsplitters";

// 示例：使用 cl100k_base 编码
const splitter = new TokenTextSplitter({ encodingName: "cl100k_base", chunkSize: 10, chunkOverlap: 0 });

const texts = splitter.splitText(stateOfTheUnion);
console.log(texts[0]);

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.

Last year COVID-19 kept us apart. This year we are finally together again.

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.

With a duty to one another to the American people to the Constitution.

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

General integrations

RAG integrations

js-tiktoken

Popular Providers

General integrations

RAG integrations

​js-tiktoken

js-tiktoken