[Doc] Add TBD Token in Datasets Statistics (#1986)

* feat * doc * doc * doc * doc
2025-05-30 16:03:24 +08:00 · 2025-03-31 19:08:55 +08:00 · 2025-03-31 19:08:55 +08:00 · f71eb78c72
commit f71eb78c72
parent 0f46c35211
5 changed files with 108 additions and 58 deletions
--- a/README.md
+++ b/README.md
@ -176,69 +176,83 @@ Some third-party features, like Humaneval and Llama, may require additional step

 After ensuring that OpenCompass is installed correctly according to the above steps and the datasets are prepared. Now you can start your first evaluation using OpenCompass!

- Your first evaluation with OpenCompass!
+### Your first evaluation with OpenCompass!

-  OpenCompass support setting your configs via CLI or a python script. For simple evaluation settings we recommend using CLI, for more complex evaluation, it is suggested using the script way. You can find more example scripts under the configs folder.
+OpenCompass support setting your configs via CLI or a python script. For simple evaluation settings we recommend using CLI, for more complex evaluation, it is suggested using the script way. You can find more example scripts under the configs folder.

-  ```bash
-  # CLI
-  opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen
+```bash
+# CLI
+opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen

-  # Python scripts
-  opencompass examples/eval_chat_demo.py
-  ```
+# Python scripts
+opencompass examples/eval_chat_demo.py
+```

-  You can find more script examples under [examples](./examples) folder.
+You can find more script examples under [examples](./examples) folder.

- API evaluation
+### API evaluation

-  OpenCompass, by its design, does not really discriminate between open-source models and API models. You can evaluate both model types in the same way or even in one settings.
+OpenCompass, by its design, does not really discriminate between open-source models and API models. You can evaluate both model types in the same way or even in one settings.

-  ```bash
-  export OPENAI_API_KEY="YOUR_OPEN_API_KEY"
-  # CLI
-  opencompass --models gpt_4o_2024_05_13 --datasets demo_gsm8k_chat_gen
+```bash
+export OPENAI_API_KEY="YOUR_OPEN_API_KEY"
+# CLI
+opencompass --models gpt_4o_2024_05_13 --datasets demo_gsm8k_chat_gen

-  # Python scripts
-  opencompass examples/eval_api_demo.py
+# Python scripts
+opencompass examples/eval_api_demo.py

-  # You can use o1_mini_2024_09_12/o1_preview_2024_09_12  for o1 models, we set max_completion_tokens=8192 as default.
-  ```
+# You can use o1_mini_2024_09_12/o1_preview_2024_09_12  for o1 models, we set max_completion_tokens=8192 as default.
+```

- Accelerated Evaluation
+### Accelerated Evaluation

-  Additionally, if you want to use an inference backend other than HuggingFace for accelerated evaluation, such as LMDeploy or vLLM, you can do so with the command below. Please ensure that you have installed the necessary packages for the chosen backend and that your model supports accelerated inference with it. For more information, see the documentation on inference acceleration backends [here](docs/en/advanced_guides/accelerator_intro.md). Below is an example using LMDeploy:
+Additionally, if you want to use an inference backend other than HuggingFace for accelerated evaluation, such as LMDeploy or vLLM, you can do so with the command below. Please ensure that you have installed the necessary packages for the chosen backend and that your model supports accelerated inference with it. For more information, see the documentation on inference acceleration backends [here](docs/en/advanced_guides/accelerator_intro.md). Below is an example using LMDeploy:

-  ```bash
-  # CLI
-  opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy
+```bash
+# CLI
+opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy

-  # Python scripts
-  opencompass examples/eval_lmdeploy_demo.py
-  ```
+# Python scripts
+opencompass examples/eval_lmdeploy_demo.py
+```

- Supported Models
+### Supported Models and Datasets

-  OpenCompass has predefined configurations for many models and datasets. You can list all available model and dataset configurations using the [tools](./docs/en/tools.md#list-configs).
+OpenCompass has predefined configurations for many models and datasets. You can list all available model and dataset configurations using the [tools](./docs/en/tools.md#list-configs).

-  ```bash
-  # List all configurations
-  python tools/list_configs.py
-  # List all configurations related to llama and mmlu
-  python tools/list_configs.py llama mmlu
-  ```
+```bash
+# List all configurations
+python tools/list_configs.py
+# List all configurations related to llama and mmlu
+python tools/list_configs.py llama mmlu
+```

-  If the model is not on the list but supported by Huggingface AutoModel class, you can also evaluate it with OpenCompass. You are welcome to contribute to the maintenance of the OpenCompass supported model and dataset lists.
+#### Supported Models

-  ```bash
-  opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat
-  ```
+If the model is not on the list but supported by Huggingface AutoModel class or encapsulation of inference engine based on OpenAI interface (see [docs](https://opencompass.readthedocs.io/en/latest/advanced_guides/new_model.html) for details), you can also evaluate it with OpenCompass. You are welcome to contribute to the maintenance of the OpenCompass supported model and dataset lists.

-  If you want to use multiple GPUs to evaluate the model in data parallel, you can use `--max-num-worker`.
+```bash
+opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat
+```

-  ```bash
-  CUDA_VISIBLE_DEVICES=0,1 opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat --max-num-worker 2
-  ```
+#### Supported Datasets
+
+Currently, OpenCompass have provided standard recommended configurations for datasets. Generally, config files ending with `_gen.py` or `_llm_judge_gen.py` will point to the recommended config we provide for this dataset. You can refer to [docs](https://opencompass.readthedocs.io/en/latest/dataset_statistics.html) for more details.
+
+```bash
+# Recommended Evaluation Config based on Rules
+opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
+
+# Recommended Evaluation Config based on LLM Judge
+opencompass --datasets aime2024_llm_judge_gen --models hf_internlm2_5_1_8b_chat
+```
+
+If you want to use multiple GPUs to evaluate the model in data parallel, you can use `--max-num-worker`.
+
+```bash
+CUDA_VISIBLE_DEVICES=0,1 opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat --max-num-worker 2
+```

 > \[!TIP\]
 >
@ -288,7 +302,7 @@ You can quickly find the dataset you need from the list through sorting, filteri

 In addition, we provide a recommended configuration for each dataset, and some datasets also support LLM Judge-based configurations.

-Please refer to the dataset statistics chapter of [official document](https://opencompass.org.cn/doc) for details.
+Please refer to the dataset statistics chapter of [docs](https://opencompass.readthedocs.io/en/latest/dataset_statistics.html) for details.

 <p align="right"><a href="#top">🔝Back to top</a></p>

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -208,9 +208,9 @@ humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ce
  opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy
  ```

-  OpenCompass 预定义了许多模型和数据集的配置，你可以通过 [工具](./docs/zh_cn/tools.md#ListConfigs) 列出所有可用的模型和数据集配置。
+- ### 支持的模型与数据集

- ### 支持的模型
+  OpenCompass 预定义了许多模型和数据集的配置，你可以通过 [工具](./docs/zh_cn/tools.md#ListConfigs) 列出所有可用的模型和数据集配置。

  ```bash
  # 列出所有配置
@ -219,13 +219,27 @@ humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ce
  python tools/list_configs.py llama mmlu
  ```

-  如果模型不在列表中但支持 Huggingface AutoModel 类，您仍然可以使用 OpenCompass 对其进行评估。欢迎您贡献维护 OpenCompass 支持的模型和数据集列表。
+  #### 支持的模型
+
+  如果模型不在列表中，但支持 Huggingface AutoModel 类或支持针对 OpenAI 接口的推理引擎封装（详见[官方文档](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/new_model.html)），您仍然可以使用 OpenCompass 对其进行评估。欢迎您贡献维护 OpenCompass 支持的模型和数据集列表。

  ```bash
  opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat
  ```

-  如果你想在多块 GPU 上使用模型进行推理，您可以使用 `--max-num-worker` 参数。
+  #### 支持的数据集
+
+  目前，OpenCompass针对数据集给出了标准的推荐配置。通常，`_gen.py`或`_llm_judge_gen.py`为结尾的配置文件将指向我们为该数据集提供的推荐配置。您可以参阅[官方文档](https://opencompass.readthedocs.io/zh-cn/latest/dataset_statistics.html) 的数据集统计章节来获取详细信息。
+
+  ```bash
+  # 基于规则的推荐配置
+  opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
+
+  # 基于LLM Judge的推荐配置
+  opencompass --datasets aime2024_llm_judge_gen --models hf_internlm2_5_1_8b_chat
+  ```
+
+  此外，如果你想在多块 GPU 上使用模型进行推理，您可以使用 `--max-num-worker` 参数。

  ```bash
  CUDA_VISIBLE_DEVICES=0,1 opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat --max-num-worker 2
@ -281,9 +295,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下

 您可以通过排序、筛选和搜索等功能从列表中快速找到您需要的数据集。

-另外，我们为每个数据集都提供了一种推荐配置，部分数据集还支持了基于LLM Judge的配置。
-
-详情请参阅 [官方文档](https://opencompass.org.cn/doc) 的数据集统计章节。
+详情请参阅 [官方文档](https://opencompass.readthedocs.io/zh-cn/latest/dataset_statistics.html) 的数据集统计章节。

 <p align="right"><a href="#top">🔝返回顶部</a></p>

--- a/dataset-index.yml
+++ b/dataset-index.yml
@ -121,7 +121,7 @@
    category: Reasoning
    paper: https://arxiv.org/pdf/2310.16049
    configpath: opencompass/configs/datasets/musr/musr_gen.py
-    configpath_llmjudge: opencompass/configs/datasets/mmlu/mmlu_llm_judge_gen.py
+    configpath_llmjudge: opencompass/configs/datasets/musr/musr_llm_judge_gen.py
 - needlebench:
    name: NeedleBench
    category: Long Context
--- a/docs/en/statis.py
+++ b/docs/en/statis.py
@ -32,12 +32,23 @@ with open(load_path, 'r') as f2:

 HEADER = ['name', 'category', 'paper', 'configpath', 'configpath_llmjudge']

+recommanded_dataset_list = [
+    'ifeval', 'aime2024', 'bbh', 'bigcodebench', 'cmmlu', 'drop', 'gpqa',
+    'hellaswag', 'humaneval', 'korbench', 'livecodebench', 'math', 'mmlu',
+    'mmlu_pro', 'musr'
+]
+

 def table_format(data_list):
    table_format_list = []
    for i in data_list:
        table_format_list_sub = []
        for j in i:
+            if j in recommanded_dataset_list:
+                link_token = '[link]('
+            else:
+                link_token = '[link(TBD)]('
+
            for index in HEADER:
                if index == 'paper':
                    table_format_list_sub.append('[link](' + i[j][index] + ')')
@ -45,18 +56,18 @@ def table_format(data_list):
                    if i[j][index] == '':
                        table_format_list_sub.append(i[j][index])
                    else:
-                        table_format_list_sub.append('[link](' +
+                        table_format_list_sub.append(link_token +
                                                     GITHUB_PREFIX +
                                                     i[j][index] + ')')
                elif index == 'configpath':
                    if isinstance(i[j][index], list):
                        sub_list_text = ''
                        for k in i[j][index]:
-                            sub_list_text += ('[link](' + GITHUB_PREFIX + k +
+                            sub_list_text += (link_token + GITHUB_PREFIX + k +
                                              ') / ')
                        table_format_list_sub.append(sub_list_text[:-2])
                    else:
-                        table_format_list_sub.append('[link](' +
+                        table_format_list_sub.append(link_token +
                                                     GITHUB_PREFIX +
                                                     i[j][index] + ')')
                else:
--- a/docs/zh_cn/statis.py
+++ b/docs/zh_cn/statis.py
@ -30,12 +30,23 @@ with open(load_path, 'r') as f2:

 HEADER = ['name', 'category', 'paper', 'configpath', 'configpath_llmjudge']

+recommanded_dataset_list = [
+    'ifeval', 'aime2024', 'bbh', 'bigcodebench', 'cmmlu', 'drop', 'gpqa',
+    'hellaswag', 'humaneval', 'korbench', 'livecodebench', 'math', 'mmlu',
+    'mmlu_pro', 'musr'
+]
+

 def table_format(data_list):
    table_format_list = []
    for i in data_list:
        table_format_list_sub = []
        for j in i:
+            if j in recommanded_dataset_list:
+                link_token = '[链接]('
+            else:
+                link_token = '[链接(TBD)]('
+
            for index in HEADER:
                if index == 'paper':
                    table_format_list_sub.append('[链接](' + i[j][index] + ')')
@ -43,17 +54,19 @@ def table_format(data_list):
                    if i[j][index] == '':
                        table_format_list_sub.append(i[j][index])
                    else:
-                        table_format_list_sub.append('[链接](' + GITHUB_PREFIX +
+                        table_format_list_sub.append(link_token +
+                                                     GITHUB_PREFIX +
                                                     i[j][index] + ')')
                elif index == 'configpath':
                    if isinstance(i[j][index], list):
                        sub_list_text = ''
                        for k in i[j][index]:
-                            sub_list_text += ('[链接](' + GITHUB_PREFIX + k +
+                            sub_list_text += (link_token + GITHUB_PREFIX + k +
                                              ') / ')
                        table_format_list_sub.append(sub_list_text[:-2])
                    else:
-                        table_format_list_sub.append('[链接](' + GITHUB_PREFIX +
+                        table_format_list_sub.append(link_token +
+                                                     GITHUB_PREFIX +
                                                     i[j][index] + ')')
                else:
                    table_format_list_sub.append(i[j][index])