- 将 Messenger 数据下载到本地磁盘。
- 创建 Chat Loader 并调用
loader.load()(或loader.lazy_load())执行转换。 - 可选:使用
merge_chat_runs合并同一发送者的连续消息,并/或使用map_ai_messages将指定发送者的消息转换为 “AIMessage” 类。完成后调用convert_messages_for_finetuning,为微调准备数据。
- 将消息上传至 OpenAI 并运行微调任务。
- 在 LangChain 应用中使用微调后的模型!
1. 下载数据
要下载自己的 Messenger 数据,请按照此处的说明操作。重要: 请务必选择 JSON 格式(勿选 HTML)。 我们在这个 Google Drive 链接中提供了示例数据,供本文演示使用。Copy
# This uses some example data
import zipfile
import requests
def download_and_unzip(url: str, output_path: str = "file.zip") -> None:
file_id = url.split("/")[-2]
download_url = f"https://drive.google.com/uc?export=download&id={file_id}"
response = requests.get(download_url)
if response.status_code != 200:
print("Failed to download the file.")
return
with open(output_path, "wb") as file:
file.write(response.content)
print(f"File {output_path} downloaded.")
with zipfile.ZipFile(output_path, "r") as zip_ref:
zip_ref.extractall()
print(f"File {output_path} has been unzipped.")
# URL of the file to download
url = (
"https://drive.google.com/file/d/1rh1s1o2i7B-Sk1v9o8KNgivLVGwJ-osV/view?usp=sharing"
)
# Download and unzip
download_and_unzip(url)
Copy
File file.zip downloaded.
File file.zip has been unzipped.
2. 创建 Chat Loader
我们提供两种FacebookMessengerChatLoader:一种用于整个聊天目录,另一种用于单个文件。
Copy
directory_path = "./hogwarts"
Copy
from langchain_community.chat_loaders.facebook_messenger import (
FolderFacebookMessengerChatLoader,
SingleFileFacebookMessengerChatLoader,
)
Copy
loader = SingleFileFacebookMessengerChatLoader(
path="./hogwarts/inbox/HermioneGranger/messages_Hermione_Granger.json",
)
Copy
chat_session = loader.load()[0]
chat_session["messages"][:3]
Copy
[HumanMessage(content="Hi Hermione! How's your summer going so far?", additional_kwargs={'sender': 'Harry Potter'}),
HumanMessage(content="Harry! Lovely to hear from you. My summer is going well, though I do miss everyone. I'm spending most of my time going through my books and researching fascinating new topics. How about you?", additional_kwargs={'sender': 'Hermione Granger'}),
HumanMessage(content="I miss you all too. The Dursleys are being their usual unpleasant selves but I'm getting by. At least I can practice some spells in my room without them knowing. Let me know if you find anything good in your researching!", additional_kwargs={'sender': 'Harry Potter'})]
Copy
loader = FolderFacebookMessengerChatLoader(
path="./hogwarts",
)
Copy
chat_sessions = loader.load()
len(chat_sessions)
Copy
9
3. 准备微调
调用load() 会将所有可提取的聊天消息映射为 HumanMessage。与真实对话相比,与聊天机器人交流通常具有更严格的轮流模式。
可以选择合并消息“runs”(同一发送者的连续消息),并指定一个发送者作为 “AI”。微调后的 LLM 将学习生成这些 AI 消息。
Copy
from langchain_community.chat_loaders.utils import (
map_ai_messages,
merge_chat_runs,
)
Copy
merged_sessions = merge_chat_runs(chat_sessions)
alternating_sessions = list(map_ai_messages(merged_sessions, "Harry Potter"))
Copy
# Now all of Harry Potter's messages will take the AIMessage class
# which maps to the 'assistant' role in OpenAI's training format
alternating_sessions[0]["messages"][:3]
Copy
[AIMessage(content="Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately.", additional_kwargs={'sender': 'Harry Potter'}),
HumanMessage(content="What is it, Potter? I'm quite busy at the moment.", additional_kwargs={'sender': 'Severus Snape'}),
AIMessage(content="I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister.", additional_kwargs={'sender': 'Harry Potter'})]
将其转换为 OpenAI 所需的字典格式
Copy
from langchain_community.adapters.openai import convert_messages_for_finetuning
Copy
training_data = convert_messages_for_finetuning(alternating_sessions)
print(f"Prepared {len(training_data)} dialogues for training")
Copy
Prepared 9 dialogues for training
Copy
training_data[0][:3]
Copy
[{'role': 'assistant',
'content': "Professor Snape, I was hoping I could speak with you for a moment about something that's been concerning me lately."},
{'role': 'user',
'content': "What is it, Potter? I'm quite busy at the moment."},
{'role': 'assistant',
'content': "I apologize for the interruption, sir. I'll be brief. I've noticed some strange activity around the school grounds at night. I saw a cloaked figure lurking near the Forbidden Forest last night. I'm worried someone may be plotting something sinister."}]
Copy
# Our chat is alternating, we will make each datapoint a group of 8 messages,
# with 2 messages overlapping
chunk_size = 8
overlap = 2
training_examples = [
conversation_messages[i : i + chunk_size]
for conversation_messages in training_data
for i in range(0, len(conversation_messages) - chunk_size + 1, chunk_size - overlap)
]
len(training_examples)
Copy
100
4. 微调模型
现在可以开始微调。请确保已安装openai 并正确设置 OPENAI_API_KEY。
Copy
pip install -qU langchain-openai
Copy
import json
import time
from io import BytesIO
import openai
# We will write the jsonl file in memory
my_file = BytesIO()
for m in training_examples:
my_file.write((json.dumps({"messages": m}) + "\n").encode("utf-8"))
my_file.seek(0)
training_file = openai.files.create(file=my_file, purpose="fine-tune")
# OpenAI audits each training file for compliance reasons.
# This make take a few minutes
status = openai.files.retrieve(training_file.id).status
start_time = time.time()
while status != "processed":
print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
time.sleep(5)
status = openai.files.retrieve(training_file.id).status
print(f"File {training_file.id} ready after {time.time() - start_time:.2f} seconds.")
Copy
File file-ULumAXLEFw3vB6bb9uy6DNVC ready after 0.00 seconds.
Copy
job = openai.fine_tuning.jobs.create(
training_file=training_file.id,
model="gpt-3.5-turbo",
)
Copy
status = openai.fine_tuning.jobs.retrieve(job.id).status
start_time = time.time()
while status != "succeeded":
print(f"Status=[{status}]... {time.time() - start_time:.2f}s", end="\r", flush=True)
time.sleep(5)
job = openai.fine_tuning.jobs.retrieve(job.id)
status = job.status
Copy
Status=[running]... 874.29s. 56.93s
Copy
print(job.fine_tuned_model)
Copy
ft:gpt-3.5-turbo-0613:personal::8QnAzWMr
5. 在 LangChain 中使用
可以将得到的模型 ID 直接传入ChatOpenAI 模型类。
Copy
from langchain_openai import ChatOpenAI
model = ChatOpenAI(
model=job.fine_tuned_model,
temperature=1,
)
Copy
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages(
[
("human", "{input}"),
]
)
chain = prompt | model | StrOutputParser()
Copy
for tok in chain.stream({"input": "What classes are you taking?"}):
print(tok, end="", flush=True)
Copy
I'm taking Charms, Defense Against the Dark Arts, Herbology, Potions, Transfiguration, and Ancient Runes. How about you?
Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.