[Update] Update performance of common benchmarks (#1109)

* [Update] Update performance of common benchmarks * [Update] Update performance of common benchmarks * [Update] Update performance of common benchmarks
2025-05-30 16:03:24 +08:00 · 2024-04-30 00:09:08 +08:00 · 2024-04-30 00:09:08 +08:00 · 063f5f5f49
commit 063f5f5f49
parent a6f67e1a65
6 changed files with 54 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -70,6 +70,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through

 ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

+- **\[2024.04.29\]** We report the performance of several famous LLMs on the common benchmarks, welcome to [documentation](https://opencompass.readthedocs.io/en/latest/user_guides/corebench.html) for more information! 🔥🔥🔥.
 - **\[2024.04.26\]** We deprecated the multi-madality evaluating function from OpenCompass, related implement has moved to [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), welcome to use! 🔥🔥🔥.
 - **\[2024.04.26\]** We supported the evaluation of [ArenaHard](configs/eval_subjective_arena_hard.py)  welcome to try!🔥🔥🔥.
 - **\[2024.04.22\]** We supported the evaluation of [LLaMA3](configs/models/hf_llama/hf_llama3_8b.py) 和 [LLaMA3-Instruct](configs/models/hf_llama/hf_llama3_8b_instruct.py), welcome to try! 🔥🔥🔥
@ -170,8 +171,11 @@ python run.py --datasets ceval_ppl mmlu_ppl \
 --num-gpus 1  # Number of minimum required GPUs
 ```

-> **Note**<br />
+> \[!TIP\]
+>
 > To run the command above, you will need to remove the comments starting from `# ` first.
+> configuration with `_ppl` is designed for base model typically.
+> configuration with `_gen` can be used for both base model and chat model.

 Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html) to learn how to run an evaluation task.

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -69,6 +69,7 @@

 ## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

+- **\[2024.04.26\]** 我们报告了典型LLM在常用基准测试上的表现，欢迎访问[文档](https://opencompass.readthedocs.io/zh-cn/latest/user_guides/corebench.html)以获取更多信息！🔥🔥🔥.
 - **\[2024.04.26\]** 我们废弃了 OpenCompass 进行多模态大模型评测的功能，相关功能转移至 [VLMEvalKit](https://github.com/open-compass/VLMEvalKit)，推荐使用！🔥🔥🔥.
 - **\[2024.04.26\]** 我们支持了 [ArenaHard评测](configs/eval_subjective_arena_hard.py) 欢迎试用！🔥🔥🔥.
 - **\[2024.04.22\]** 我们支持了 [LLaMA3](configs/models/hf_llama/hf_llama3_8b.py) 和 [LLaMA3-Instruct](configs/models/hf_llama/hf_llama3_8b_instruct.py) 的评测，欢迎试用！🔥🔥🔥.
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
@ -40,6 +40,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
   user_guides/experimentation.md
   user_guides/metrics.md
   user_guides/summarizer.md
+   user_guides/corebench.md

 .. _Prompt:
 .. toctree::
--- a/docs/en/user_guides/corebench.md
+++ b/docs/en/user_guides/corebench.md
@ -0,0 +1,23 @@
+# Performance of Common Benchmarks
+
+We have identified several well-known benchmarks for evaluating large language models (LLMs), and provide detailed performance results of famous LLMs on these datasets.
+
+| Model                | Version | Metric                       | Mode | GPT-4-1106 | GPT-4-0409 | Claude-3-Opus | Llama-3-70b-Instruct(lmdeploy) | Mixtral-8x22B-Instruct-v0.1 |
+| -------------------- | ------- | ---------------------------- | ---- | ---------- | ---------- | ------------- | ------------------------------ | --------------------------- |
+| MMLU                 | -       | naive_average                | gen  | 83.6       | 84.2       | 84.6          | 80.5                           | 77.2                        |
+| CMMLU                | -       | naive_average                | gen  | 71.9       | 72.4       | 74.2          | 70.1                           | 59.7                        |
+| CEval-Test           | -       | naive_average                | gen  | 69.7       | 70.5       | 71.7          | 66.9                           | 58.7                        |
+| GaokaoBench          | -       | weighted_average             | gen  | 74.8       | 76.0       | 74.2          | 67.8                           | 60.0                        |
+| Triviaqa_wiki(1shot) | 01cf41  | score                        | gen  | 73.1       | 82.9       | 82.4          | 89.8                           | 89.7                        |
+| NQ_open(1shot)       | eaf81e  | score                        | gen  | 27.9       | 30.4       | 39.4          | 40.1                           | 46.8                        |
+| Race-High            | 9a54b6  | accuracy                     | gen  | 89.3       | 89.6       | 90.8          | 89.4                           | 84.8                        |
+| WinoGrande           | 6447e6  | accuracy                     | gen  | 80.7       | 83.3       | 84.1          | 69.7                           | 76.6                        |
+| HellaSwag            | e42710  | accuracy                     | gen  | 92.7       | 93.5       | 94.6          | 87.7                           | 86.1                        |
+| BBH                  | -       | naive_average                | gen  | 82.7       | 78.5       | 78.5          | 80.5                           | 79.1                        |
+| GSM-8K               | 1d7fe4  | accuracy                     | gen  | 80.5       | 79.7       | 87.7          | 90.2                           | 88.3                        |
+| Math                 | 393424  | accuracy                     | gen  | 61.9       | 71.2       | 60.2          | 47.1                           | 50                          |
+| TheoremQA            | ef26ca  | accuracy                     | gen  | 28.4       | 23.3       | 29.6          | 25.4                           | 13                          |
+| HumanEval            | 8e312c  | humaneval_pass@1             | gen  | 74.4       | 82.3       | 76.2          | 72.6                           | 72.0                        |
+| MBPP(sanitized)      | 1e1056  | score                        | gen  | 78.6       | 77.0       | 76.7          | 71.6                           | 68.9                        |
+| GPQA_diamond         | 4baadb  | accuracy                     | gen  | 40.4       | 48.5       | 46.5          | 38.9                           | 36.4                        |
+| IFEval               | 3321a3  | Prompt-level-strict-accuracy | gen  | 71.9       | 79.9       | 80.0          | 77.1                           | 65.8                        |
--- a/docs/zh_cn/index.rst
+++ b/docs/zh_cn/index.rst
@ -41,6 +41,7 @@ OpenCompass 上手路线
   user_guides/experimentation.md
   user_guides/metrics.md
   user_guides/summarizer.md
+   user_guides/corebench.md

 .. _提示词:
 .. toctree::
--- a/docs/zh_cn/user_guides/corebench.md
+++ b/docs/zh_cn/user_guides/corebench.md
@ -0,0 +1,23 @@
+# 主要数据集性能
+
+我们选择部分用于评估大型语言模型（LLMs）的知名基准，并提供了主要的LLMs在这些数据集上的详细性能结果。
+
+| Model                | Version | Metric                       | Mode | GPT-4-1106 | GPT-4-0409 | Claude-3-Opus | Llama-3-70b-Instruct(lmdeploy) | Mixtral-8x22B-Instruct-v0.1 |
+| -------------------- | ------- | ---------------------------- | ---- | ---------- | ---------- | ------------- | ------------------------------ | --------------------------- |
+| MMLU                 | -       | naive_average                | gen  | 83.6       | 84.2       | 84.6          | 80.5                           | 77.2                        |
+| CMMLU                | -       | naive_average                | gen  | 71.9       | 72.4       | 74.2          | 70.1                           | 59.7                        |
+| CEval-Test           | -       | naive_average                | gen  | 69.7       | 70.5       | 71.7          | 66.9                           | 58.7                        |
+| GaokaoBench          | -       | weighted_average             | gen  | 74.8       | 76.0       | 74.2          | 67.8                           | 60.0                        |
+| Triviaqa_wiki(1shot) | 01cf41  | score                        | gen  | 73.1       | 82.9       | 82.4          | 89.8                           | 89.7                        |
+| NQ_open(1shot)       | eaf81e  | score                        | gen  | 27.9       | 30.4       | 39.4          | 40.1                           | 46.8                        |
+| Race-High            | 9a54b6  | accuracy                     | gen  | 89.3       | 89.6       | 90.8          | 89.4                           | 84.8                        |
+| WinoGrande           | 6447e6  | accuracy                     | gen  | 80.7       | 83.3       | 84.1          | 69.7                           | 76.6                        |
+| HellaSwag            | e42710  | accuracy                     | gen  | 92.7       | 93.5       | 94.6          | 87.7                           | 86.1                        |
+| BBH                  | -       | naive_average                | gen  | 82.7       | 78.5       | 78.5          | 80.5                           | 79.1                        |
+| GSM-8K               | 1d7fe4  | accuracy                     | gen  | 80.5       | 79.7       | 87.7          | 90.2                           | 88.3                        |
+| Math                 | 393424  | accuracy                     | gen  | 61.9       | 71.2       | 60.2          | 47.1                           | 50                          |
+| TheoremQA            | ef26ca  | accuracy                     | gen  | 28.4       | 23.3       | 29.6          | 25.4                           | 13                          |
+| HumanEval            | 8e312c  | humaneval_pass@1             | gen  | 74.4       | 82.3       | 76.2          | 72.6                           | 72.0                        |
+| MBPP(sanitized)      | 1e1056  | score                        | gen  | 78.6       | 77.0       | 76.7          | 71.6                           | 68.9                        |
+| GPQA_diamond         | 4baadb  | accuracy                     | gen  | 40.4       | 48.5       | 46.5          | 38.9                           | 36.4                        |
+| IFEval               | 3321a3  | Prompt-level-strict-accuracy | gen  | 71.9       | 79.9       | 80.0          | 77.1                           | 65.8                        |