[Doc] Add summarizer doc (#231)

* add summarizer doc * update * update doc * Apply suggestions from code review --------- Co-authored-by: Tong Gao <gaotongxiao@gmail.com>
2025-05-30 16:03:24 +08:00 · 2023-08-23 11:18:01 +08:00 · 2023-08-23 11:18:01 +08:00 · c0e58632ca
commit c0e58632ca
parent a85634a32a
6 changed files with 132 additions and 0 deletions
--- a/docs/en/get_started.md
+++ b/docs/en/get_started.md
@ -245,6 +245,10 @@ outputs/default/
 ├── ...
 ```

+The summarization process can be further customized in configuration and output the averaged score of some benchmarks (MMLU, C-Eval, etc.).
+
+More information about obtaining evaluation results can be found in [Results Summary](./user_guides/summarizer.md).
+
 ## Additional Tutorials

 To learn more about using OpenCompass, explore the following tutorials:
@ -253,4 +257,5 @@ To learn more about using OpenCompass, explore the following tutorials:
 - [Prepare Models](./user_guides/models.md)
 - [Task Execution and Monitoring](./user_guides/experimentation.md)
 - [Understand Prompts](./prompt/overview.md)
+- [Results Summary](./user_guides/summarizer.md)
 - [Learn about Config](./user_guides/config.md)
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
@ -36,6 +36,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
   user_guides/evaluation.md
   user_guides/experimentation.md
   user_guides/metrics.md
+   user_guides/summarizer.md

 .. _Prompt:
 .. toctree::
--- a/docs/en/user_guides/summarizer.md
+++ b/docs/en/user_guides/summarizer.md
@ -0,0 +1,60 @@
+# Results Summary
+
+After the evaluation is complete, the results need to be printed on the screen or saved. This process is controlled by the summarizer.
+
+```{note}
+If the summarizer appears in the overall config, all the evaluation results will be output according to the following logic.
+If the summarizer does not appear in the overall config, the evaluation results will be output in the order they appear in the `dataset` config.
+```
+
+## Example
+
+A typical summarizer configuration file is as follows:
+
+```python
+summarizer = dict(
+    dataset_abbrs = [
+        'race',
+        'race-high',
+        'race-middle',
+    ],
+    summary_groups=[
+        {'name': 'race', 'subsets': ['race-high', 'race-middle']},
+    ]
+)
+```
+
+The output is:
+
+```text
+dataset      version    metric         mode      internlm-7b-hf
+-----------  ---------  -------------  ------  ----------------
+race         -          naive_average  ppl                76.23
+race-high    0c332f     accuracy       ppl                74.53
+race-middle  0c332f     accuracy       ppl                77.92
+```
+
+The summarizer tries to read the evaluation scores from the `{work_dir}/results/` directory using the `models` and `datasets` in the config as the full set. It then displays them in the order of the `summarizer.dataset_abbrs` list. Moreover, the summarizer tries to compute some aggregated metrics using `summarizer.summary_groups`. The `name` metric is only generated if and only if all values in `subsets` exist. This means if some scores are missing, the aggregated metric will also be missing. If scores can't be fetched by the above methods, the summarizer will use `-` in the respective cell of the table.
+
+In addition, the output consists of multiple columns:
+
+- The `dataset` column corresponds to the `summarizer.dataset_abbrs` configuration.
+- The `version` column is the hash value of the dataset, which considers the dataset's evaluation method, prompt words, output length limit, etc. Users can verify whether two evaluation results are comparable using this column.
+- The `metric` column indicates the evaluation method of this metric. For specific details, [metrics](./metrics.md).
+- The `mode` column indicates how the inference result is obtained. Possible values are `ppl` / `gen`. For items in `summarizer.summary_groups`, if the methods of obtaining `subsets` are consistent, its value will be the same as subsets, otherwise it will be `mixed`.
+- The subsequent columns represent different models.
+
+## Field Description
+
+The fields of summarizer are explained as follows:
+
+- `dataset_abbrs`: (list, optional) Display list items. If omitted, all evaluation results will be output.
+- `summary_groups`: (list, optional) Configuration for aggregated metrics.
+
+The fields in `summary_groups` are:
+
+- `name`: (str) Name of the aggregated metric.
+- `subsets`: (list) Names of the metrics that are aggregated. Note that it can not only be the original `dataset_abbr` but also the name of another aggregated metric.
+- `weights`: (list, optional) Weights of the metrics being aggregated. If omitted, the default is to use unweighted averaging.
+
+Please note that we have stored the summary groups of datasets like MMLU, C-Eval, etc., under the `configs/summarizers/groups` path. It's recommended to consider using them first.
--- a/docs/zh_cn/get_started.md
+++ b/docs/zh_cn/get_started.md
@ -245,6 +245,10 @@ outputs/default/
 ├── ...
 ```

+打印评测结果的过程可被进一步定制化，用于输出一些数据集的平均分 (例如 MMLU, C-Eval 等)。
+
+关于评测结果输出的更多介绍可阅读 [结果展示](./user_guides/summarizer.md)。
+
 ## 更多教程

 想要更多了解 OpenCompass, 可以点击下列链接学习。
@ -253,4 +257,5 @@ outputs/default/
 - [准备模型](./user_guides/models.md)
 - [任务运行和监控](./user_guides/experimentation.md)
 - [如何调Prompt](./prompt/overview.md)
+- [结果展示](./user_guides/summarizer.md)
 - [学习配置文件](./user_guides/config.md)
--- a/docs/zh_cn/index.rst
+++ b/docs/zh_cn/index.rst
@ -37,6 +37,7 @@ OpenCompass 上手路线
   user_guides/evaluation.md
   user_guides/experimentation.md
   user_guides/metrics.md
+   user_guides/summarizer.md

 .. _提示词:
 .. toctree::
--- a/docs/zh_cn/user_guides/summarizer.md
+++ b/docs/zh_cn/user_guides/summarizer.md
@ -0,0 +1,60 @@
+# 结果展示
+
+在评测完成后，评测的结果需要被打印到屏幕或者被保存下来，该过程是由 summarizer 控制的。
+
+```{note}
+如果 summarizer 出现在了 config 中，则评测结果输出会按照下述逻辑进行。
+如果 summarizer 没有出现在 config 中，则评测结果会按照 `dataset` 中出现的顺序进行输出。
+```
+
+## 样例
+
+一个典型的 summerizer 配置文件如下：
+
+```python
+summarizer = dict(
+    dataset_abbrs = [
+        'race',
+        'race-high',
+        'race-middle',
+    ],
+    summary_groups=[
+        {'name': 'race', 'subsets': ['race-high', 'race-middle']},
+    ]
+)
+```
+
+其输出结果如下：
+
+```text
+dataset      version    metric         mode      internlm-7b-hf
+-----------  ---------  -------------  ------  ----------------
+race         -          naive_average  ppl                76.23
+race-high    0c332f     accuracy       ppl                74.53
+race-middle  0c332f     accuracy       ppl                77.92
+```
+
+summarizer 会以 config 中的 `models`, `datasets` 为全集，去尝试读取 `{work_dir}/results/` 路径下的评测分数，并按照 `summarizer.dataset_abbrs` 列表的顺序进行展示。另外，summarizer 会尝试通过 `summarizer.summary_groups` 来进行一些汇总指标的计算。当且仅当 `subsets` 中的值都存在时，对应的 `name` 指标才会生成，这也就是说，若有部分数字缺失，则这个汇总指标也是会缺失的。若分数无法通过上述两种方式被获取到，则 summarizer 会在表格中对应项处使用 `-` 进行表示。
+
+此外，输出结果是有多列的：
+
+- `dataset` 列与 `summarizer.dataset_abbrs` 配置一一对应
+- `version` 列是这个数据集的 hash 值，该 hash 值会考虑该数据集模板的评测方式、提示词、输出长度限制等信息。用户可通过该列信息确认两份评测结果是否可比
+- `metric` 列是指这个指标的评测方式，具体说明见 [metrics](./metrics.md)
+- `mode` 列是指这个推理结果的获取方式，可能的值有 `ppl` / `gen`。对于 `summarizer.summary_groups` 的项，若被 `subsets` 的获取方式都一致，则其值也跟 `subsets` 一致，否则即为 `mixed`
+- 其后若干列，一列代表一个模型
+
+## 完整字段说明
+
+summarizer 字段说明如下
+
+- `dataset_abbrs`: (list，可选) 展示列表项。若该项省略，则会输出全部评测结果。
+- `summary_groups`: (list，可选) 汇总指标配置。
+
+`summary_groups` 中的字段说明如下：
+
+- `name`: (str) 汇总指标的名称。
+- `subsets`: (list) 被汇总指标的名称。注意它不止可以是原始的 `dataset_abbr`，也可以是另一个汇总指标的名称。
+- `weights`: (list，可选) 被汇总指标的权重。若该项省略，则默认使用不加权的求平均方法。
+
+注意，我们在 `configs/summarizers/groups` 路径下存放了 MMLU, C-Eval 等数据集的评测结果汇总，建议优先考虑使用。