对于 Python 中的大型评估作业,我们建议使用 aevaluate(),即 evaluate() 的异步版本。在阅读有关异步运行评估的操作指南之前,仍然值得先阅读本指南,因为两者具有相同的接口。在 JS/TS 中,evaluate() 已经是异步的,因此不需要单独的方法。在运行大型作业时,配置
max_concurrency/maxConcurrency 参数也很重要。这通过有效地将数据集拆分到线程来并行化评估。定义应用程序
首先我们需要一个要评估的应用程序。让我们为此示例创建一个简单的毒性分类器。Create or select a dataset
We need a Dataset to evaluate our application on. Our dataset will contain labeled examples of toxic and non-toxic text. Requireslangsmith>=0.3.13
Define an evaluator
You can also check out LangChain’s open source evaluation package openevals for common pre-built evaluators.
- Python: Requires
langsmith>=0.3.13 - TypeScript: Requires
langsmith>=0.2.9
Run the evaluation
We’ll use the evaluate() / aevaluate() methods to run the evaluation. The key arguments are:- a target function that takes an input dictionary and returns an output dictionary. The
example.inputsfield of each Example is what gets passed to the target function. In this case ourtoxicity_classifieris already set up to take in example inputs so we can use it directly. data- the name OR UUID of the LangSmith dataset to evaluate on, or an iterator of examplesevaluators- a list of evaluators to score the outputs of the function
langsmith>=0.3.13
Explore the results
Each invocation ofevaluate() creates an Experiment which can be viewed in the LangSmith UI or queried via the SDK. Evaluation scores are stored against each actual output as feedback.
If you’ve annotated your code for tracing, you can open the trace of each row in a side panel view.
Reference code
Click to see a consolidated code snippet
Click to see a consolidated code snippet
Related
- Run an evaluation asynchronously
- Run an evaluation via the REST API
- Run an evaluation from the prompt playground