[Docs] Update evaluation doc (#39)

2025-05-30 16:03:24 +08:00 · 2023-07-17 14:12:19 +08:00 · 2023-07-17 14:12:19 +08:00 · 77a1cc4486
commit 77a1cc4486
parent e19a0c1cf8
2 changed files with 103 additions and 15 deletions
--- a/docs/en/user_guides/evaluation.md
+++ b/docs/en/user_guides/evaluation.md
@ -2,7 +2,36 @@

 OpenCompass supports custom task partitioners (`Partitioner`), which enable flexible division of evaluation tasks. In conjunction with `Runner`, which controls the platform for task execution, such as a local machine or a cluster, OpenCompass can distribute large evaluation tasks to a vast number of computing nodes. This helps utilize computational resources efficiently and significantly accelerates the evaluation process.

-## Task Division (Partitioner)
+By default, OpenCompass hides these details from users and automatically selects the recommended execution strategies. But users can still customize these strategies of the workflows according to their needs, just by adding the `infer` and/or `eval` fields to the configuration file:
+
+```python
+from opencompass.partitioners import SizePartitioner, NaivePartitioner
+from opencompass.runners import SlurmRunner
+from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
+
+infer = dict(
+    partitioner=dict(type=SizePartitioner, max_task_size=5000),
+    runner=dict(
+        type=SlurmRunner,
+        max_num_workers=64,
+        task=dict(type=OpenICLInferTask),
+        retry=5),
+)
+
+eval = dict(
+    partitioner=dict(type=NaivePartitioner),
+    runner=dict(
+        type=LocalRunner,
+        max_num_workers=32,
+        task=dict(type=OpenICLEvalTask)),
+)
+```
+
+The example above demonstrates the way to configure the execution strategies for the inference and evaluation stages. At the inference stage, the task will be divided into several sub-tasks, each of 5000 samples, and then submitted to the Slurm cluster for execution, where there are at most 64 tasks running in parallel. At the evaluation stage, each single model-dataset pair forms a task, and 32 processes are launched locally to compute the metrics.
+
+The following sections will introduce the involved modules in detail.
+
+## Task Partition (Partitioner)

 Due to the long inference time of large language models and the vast amount of evaluation datasets, serial execution of a single evaluation task can be quite time-consuming. OpenCompass allows custom task partitioners (`Partitioner`) to divide large evaluation tasks into numerous independent smaller tasks, thus fully utilizing computational resources via parallel execution. Users can configure the task partitioning strategies for the inference and evaluation stages via `infer.partitioner` and `eval.partitioner`. Below, we will introduce all the partitioning strategies supported by OpenCompass.

@ -14,7 +43,7 @@ This partitioner dispatches each combination of a model and dataset as an indepe
 from opencompass.partitioners import NaivePartitioner

 infer = dict(
-    partitioner=dict(type=SizePartitioner)
+    partitioner=dict(type=NaivePartitioner)
    # ...
 )
 ```
@ -22,10 +51,10 @@ infer = dict(
 ### `SizePartitioner`

 ```{warning}
-This partitioner is not suitable for evaluation stage tasks (OpenEvalTask).
+For now, this partitioner is not suitable for evaluation tasks (`OpenICLEvalTask`).
 ```

-This partitioner estimates the inference cost (time) of a dataset according to its size, multiplied by an expansion coefficient. It then creates tasks by splitting larger datasets and merging smaller ones to ensure the inference costs of each sub-task are as equal as possible.
+This partitioner estimates the inference cost (time) of a dataset according to its size, multiplied by an empirical expansion coefficient. It then creates tasks by splitting larger datasets and merging smaller ones to ensure the inference costs of each sub-task are as equal as possible.

 The commonly used parameters for this partitioner are as follows:

@ -35,7 +64,7 @@ from opencompass.partitioners import SizePartitioner
 infer = dict(
    partitioner=dict(
        type=SizePartitioner,
-        max_task_size: int = 2000,  # Maximum length of a single task
+        max_task_size: int = 2000,  # Maximum size of each task
        gen_task_coef: int = 20,  # Expansion coefficient for generative tasks
    ),
    # ...
@ -54,7 +83,7 @@ In a multi-card, multi-machine cluster environment, if we want to implement para

 ### `LocalRunner`

-`LocalRunner` is the most basic runner that can run tasks in serial on the local machine.
+`LocalRunner` is the most basic runner that can run tasks parallelly on the local machine.

 ```python
 from opencompass.runners import LocalRunner
@ -64,12 +93,15 @@ infer = dict(
    # ...
    runner=dict(
        type=LocalRunner,
+        max_num_workers=16,  # Maximum number of processes to run in parallel
        task=dict(type=OpenICLEvalTask),  # Task to be run
    )
 )
 ```

-In the future, we plan to enhance the capabilities of `LocalRunner` to effectively utilize multi-card resources on a single machine.
+```{note}
+The actual number of running tasks are both limited by the actual available GPU resources and the number of workers.
+```

 ### `SlurmRunner`

@ -85,8 +117,6 @@ infer = dict(
        type=SlurmRunner,
        task=dict(type=OpenICLEvalTask),  # Task to be run
        max_num_workers=16,  # Maximum concurrent evaluation task count
-        partition='lm',  # The Slurm partition for running tasks
-        quotatype='auto',  # (Supported only in some Slurm, can be left unset) The priority for running tasks
        retry=2,  # Retry count for failed tasks, can prevent accidental errors
    ),
 )
@ -127,3 +157,17 @@ infer = dict(
    ),
 )
 ```
+
+## Task
+
+A Task is a fundamental module in OpenCompass, a standalone script that executes the computationally intensive operations. Each task is designed to load a configuration file to determine parameter settings, and it can be executed in two distinct ways:
+
+2. Instantiate a Task object, then call `task.run()`.
+3. Call `get_command` method by passing in the config path and the command template string that contains `{task_cmd}` as a placeholder (e.g. `srun {task_cmd}`). The returned command string will be the full command and can be executed directly.
+
+As of now, OpenCompass supports the following task types:
+
+- `OpenICLInferTask`: Perform LM Inference task based on OpenICL framework.
+- `OpenICLEvalTask`: Perform LM Evaluation task based on OpenEval framework.
+
+In the future, more task types will be supported.
--- a/docs/zh_cn/user_guides/evaluation.md
+++ b/docs/zh_cn/user_guides/evaluation.md
@ -2,6 +2,35 @@

 OpenCompass 支持自定义评测任务的任务划分器（`Partitioner`），实现评测任务的灵活切分；同时配合 `Runner` 控制任务执行的平台，如本机及集群。通过二者的组合，OpenCompass 可以将大评测任务分割到大量计算节点上运行，高效利用计算资源，从而大大加速评测流程。

+默认情况下，OpenCompass 向用户隐藏了这些细节，并自动选择推荐的执行策略。但是，用户仍然可以根据自己需求定制其策略，只需向配置文件中添加 `infer` 和/或 `eval` 字段即可：
+
+```python
+from opencompass.partitioners import SizePartitioner, NaivePartitioner
+from opencompass.runners import SlurmRunner
+from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
+
+infer = dict(
+    partitioner=dict(type=SizePartitioner, max_task_size=5000),
+    runner=dict(
+        type=SlurmRunner,
+        max_num_workers=64,
+        task=dict(type=OpenICLInferTask),
+        retry=5),
+)
+
+eval = dict(
+    partitioner=dict(type=NaivePartitioner),
+    runner=dict(
+        type=LocalRunner,
+        max_num_workers=32,
+        task=dict(type=OpenICLEvalTask)),
+)
+```
+
+上面的例子演示了如何为推理和评估阶段配置执行策略。在推理阶段，任务将被划分成若干个子任务，每个子任务包含5000个样本，然后提交到 Slurm 集群进行执行，其中最多有64个任务并行运行。在评估阶段，每个单一的模型-数据集对形成一个任务，并在本地启动32个进程来计算指标。
+
+以下章节将详细介绍里面涉及的模块。
+
 ## 任务划分 (Partitioner)

 由于大语言模型的推理耗时长，评测的数据集量大，因此串行运行一次评测任务的时间开销往往很大。
@ -15,7 +44,7 @@ OpenCompass 支持通过自定义评测任务的任务划分器（`Partitioner`
 from opencompass.partitioners import NaivePartitioner

 infer = dict(
-    partitioner=dict(type=SizePartitioner)
+    partitioner=dict(type=NaivePartitioner)
    # ...
 )
 ```
@ -23,7 +52,7 @@ infer = dict(
 ### `SizePartitioner`

 ```{warning}
-该划分器不适用于评测阶段的任务（OpenEvalTask）。
+该划分器目前不适用于评测阶段的任务（`OpenICLEvalTask`）。
 ```

 该划分器会根据数据集的大小，乘上一个扩张系数，估算该数据集的推理成本（耗时）。然后会通过切分大数据集、合并小数据集的方式创建任务，尽可能保证各个子任务推理成本均等。
@ -55,7 +84,7 @@ infer = dict(

 ### `LocalRunner`

-`LocalRunner` 为最基本的运行器，可以将任务在本机串行运行。
+`LocalRunner` 为最基本的运行器，可以将任务在本机并行运行。

 ```python
 from opencompass.runners import LocalRunner
@ -65,12 +94,15 @@ infer = dict(
    # ...
    runner=dict(
        type=LocalRunner,
+        max_num_workers=16,  # 最大并行运行进程数
        task=dict(type=OpenICLEvalTask),  # 待运行的任务
    )
 )
 ```

-在未来，我们计划提升 `LocalRunner` 的能力，实现单机多卡资源的高效利用。
+```{note}
+实际的运行任务数受到可用 GPU 资源和 `max_num_workers` 的限制。
+```

 ### `SlurmRunner`

@ -86,8 +118,6 @@ infer = dict(
        type=SlurmRunner,
        task=dict(type=OpenICLEvalTask),  # 待运行任务
        max_num_workers=16,  # 最大同时评测任务数
-        partition='lm',  # 运行任务的 Slurm 分区
-        quotatype='auto',  # （仅在某些 Slurm 中支持，可以不设置）运行任务的优先级
        retry=2,  # 任务失败的重试次数，可以避免意外发生的错误
    ),
 )
@ -129,3 +159,17 @@ infer = dict(
 )

 ```
+
+## 任务 (Task)
+
+任务（Task）是 OpenCompass 中的一个基础模块，本身是一个独立的脚本，用于执行计算密集的操作。每个任务都通过配置文件确定参数设置，且可以通过两种不同的方式执行：
+
+1. 实例化一个任务对象，然后调用 `task.run()` 方法。
+2. 调用 `get_command` 方法，并传入配置路径和包含 `{task_cmd}` 占位符的命令模板字符串（例如 `srun {task_cmd}`）。返回的命令字符串将是完整的命令，可以直接执行。
+
+目前，OpenCompass 支持以下任务类型：
+
+- `OpenICLInferTask`：基于 OpenICL 框架执行语言模型（LM）推断任务。
+- `OpenICLEvalTask`：基于 OpenEval 框架执行语言模型（LM）评估任务。
+
+未来，OpenCompass 将支持更多类型的任务。