Facebook Messenger

本笔记演示如何将 Facebook 数据加载为可用于微调的格式。整体步骤如下：

将 Messenger 数据下载到本地磁盘。
创建 Chat Loader 并调用 loader.load()（或 loader.lazy_load()）执行转换。
可选：使用 merge_chat_runs 合并同一发送者的连续消息，并/或使用 map_ai_messages 将指定发送者的消息转换为 “AIMessage” 类。完成后调用 convert_messages_for_finetuning，为微调准备数据。

完成上述步骤后，即可微调模型：

将消息上传至 OpenAI 并运行微调任务。
在 LangChain 应用中使用微调后的模型！

下面开始。

1. 下载数据

要下载自己的 Messenger 数据，请按照此处的说明操作。重要： 请务必选择 JSON 格式（勿选 HTML）。我们在这个 Google Drive 链接中提供了示例数据，供本文演示使用。

# This uses some example data
import zipfile

import requests


def download_and_unzip(url: str, output_path: str = "file.zip") -> None:
    file_id = url.split("/")[-2]
    download_url = f"https://drive.google.com/uc?export=download&id={file_id}"

    response = requests.get(download_url)
    if response.status_code != 200:
        print("Failed to download the file.")
        return

    with open(output_path, "wb") as file:
        file.write(response.content)
        print(f"File {output_path} downloaded.")

    with zipfile.ZipFile(output_path, "r") as zip_ref:
        zip_ref.extractall()
        print(f"File {output_path} has been unzipped.")


# URL of the file to download
url = (
    "https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing"
)

# Download and unzip
download_and_unzip(url)

File file.zip downloaded.
File file.zip has been unzipped.

2. 创建 Chat Loader

我们提供两种 FacebookMessengerChatLoader：一种用于整个聊天目录，另一种用于单个文件。

directory_path = "./hogwarts"

from langchain_community.chat_loaders.facebook_messenger import (
    FolderFacebookMessengerChatLoader,
    SingleFileFacebookMessengerChatLoader,
)

loader = SingleFileFacebookMessengerChatLoader(
    path="./hogwarts/inbox/HermioneGranger/messages_Hermione_Granger.json",
)

chat_session = loader.load()[0]
chat_session["messages"][:3]

[HumanMessage(content="Hi Hermione! How's your summer going so far?", additional_kwargs={'sender': 'Harry Potter'}),
 HumanMessage(content="Harry! Lovely to hear from you. My summer is going well, though I do miss everyone. I'm spending most of my time going through my books and researching fascinating new topics. How about you?", additional_kwargs={'sender': 'Hermione Granger'}),
 HumanMessage(content="I miss you all too. The Dursleys are being their usual unpleasant selves but I'm getting by. At least I can practice some spells in my room without them knowing. Let me know if you find anything good in your researching!", additional_kwargs={'sender': 'Harry Potter'})]

loader = FolderFacebookMessengerChatLoader(
    path="./hogwarts",
)

chat_sessions = loader.load()
len(chat_sessions)

3. 准备微调

调用 load() 会将所有可提取的聊天消息映射为 HumanMessage。与真实对话相比，与聊天机器人交流通常具有更严格的轮流模式。可以选择合并消息“runs”（同一发送者的连续消息），并指定一个发送者作为 “AI”。微调后的 LLM 将学习生成这些 AI 消息。

from langchain_community.chat_loaders.utils import (
    map_ai_messages,
    merge_chat_runs,
)

merged_sessions = merge_chat_runs(chat_sessions)
alternating_sessions = list(map_ai_messages(merged_sessions, "Harry Potter"))

# Now all of Harry Potter's messages will take the AIMessage class
# which maps to the 'assistant' role in OpenAI's training format
alternating_sessions[0]["messages"][:3]

[AIMessage(content="Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.", additional_kwargs={'sender': 'Harry Potter'}),
 HumanMessage(content="What is it, Potter? I'm quite busy at the moment.", additional_kwargs={'sender': 'Severus Snape'}),
 AIMessage(content="I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.", additional_kwargs={'sender': 'Harry Potter'})]

将其转换为 OpenAI 所需的字典格式

from langchain_community.adapters.openai import convert_messages_for_finetuning

training_data = convert_messages_for_finetuning(alternating_sessions)
print(f"Prepared {len(training_data)} dialogues for training")

Prepared 9 dialogues for training

training_data[0][:3]

[{'role': 'assistant',
  'content': "Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately."},
 {'role': 'user',
  'content': "What is it, Potter? I'm quite busy at the moment."},
 {'role': 'assistant',
  'content': "I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister."}]

OpenAI 目前要求至少 10 个训练样本才能进行微调，大多数任务建议 50-100 个。由于我们只有 9 个聊天会话，可以将它们拆分（可选择重叠）成多个片段，使每个训练样本仅包含部分对话。 Facebook 聊天会话（每人一个）通常跨越多天、多段对话，因此长距离依赖并不一定需要完整建模。

# Our chat is alternating, we will make each datapoint a group of 8 messages,
# with 2 messages overlapping
chunk_size = 8
overlap = 2

training_examples = [
    conversation_messages[i : i + chunk_size]
    for conversation_messages in training_data
    for i in range(0, len(conversation_messages) - chunk_size + 1, chunk_size - overlap)
]

len(training_examples)

4. 微调模型

现在可以开始微调。请确保已安装 openai 并正确设置 OPENAI_API_KEY。

pip install -qU  langchain-openai

import json
import time
from io import BytesIO

import openai

# We will write the jsonl file in memory
my_file = BytesIO()
for m in training_examples:
    my_file.write((json.dumps({"messages": m}) + "\n").encode("utf-8"))

my_file.seek(0)
training_file = openai.files.create(file=my_file, purpose="fine-tune")

# OpenAI audits each training file for compliance reasons.
# This make take a few minutes
status = openai.files.retrieve(training_file.id).status
start_time = time.time()
while status != "processed":
    print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
    time.sleep(5)
    status = openai.files.retrieve(training_file.id).status
print(f"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.")

File file-ULumAXLEFw3vB6bb9uy6DNVC ready after 0.00 seconds.

文件准备好后即可启动训练任务。

job = openai.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-3.5-turbo",
)

在等待模型训练期间可以稍作休息——这可能需要一些时间。

status = openai.fine_tuning.jobs.retrieve(job.id).status
start_time = time.time()
while status != "succeeded":
    print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
    time.sleep(5)
    job = openai.fine_tuning.jobs.retrieve(job.id)
    status = job.status

Status=[running]... 874.29s. 56.93s

print(job.fine_tuned_model)

ft:gpt-3.5-turbo-0613:personal::8QnAzWMr

5. 在 LangChain 中使用

可以将得到的模型 ID 直接传入 ChatOpenAI 模型类。

from langchain_openai import ChatOpenAI

model = ChatOpenAI(
    model=job.fine_tuned_model,
    temperature=1,
)

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{input}"),
    ]
)

chain = prompt | model | StrOutputParser()

for tok in chain.stream({"input": "What classes are you taking?"}):
    print(tok, end="", flush=True)

I'm taking Charms, Defense Against the Dark Arts, Herbology, Potions, Transfiguration, and Ancient Runes. How about you?

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

Popular Providers

Integrations by component

1. 下载数据

2. 创建 Chat Loader

3. 准备微调

将其转换为 OpenAI 所需的字典格式

4. 微调模型

5. 在 LangChain 中使用

Popular Providers

Integrations by component

​1. 下载数据

​2. 创建 Chat Loader

​3. 准备微调

​将其转换为 OpenAI 所需的字典格式

​4. 微调模型

​5. 在 LangChain 中使用

1. 下载数据

2. 创建 Chat Loader

3. 准备微调

将其转换为 OpenAI 所需的字典格式

4. 微调模型

5. 在 LangChain 中使用