mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00
[Update] Update performance of common benchmarks (#1109)
* [Update] Update performance of common benchmarks * [Update] Update performance of common benchmarks * [Update] Update performance of common benchmarks
This commit is contained in:
parent
a6f67e1a65
commit
063f5f5f49
@ -70,6 +70,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
|
||||
|
||||
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
|
||||
|
||||
- **\[2024.04.29\]** We report the performance of several famous LLMs on the common benchmarks, welcome to [documentation](https://opencompass.readthedocs.io/en/latest/user_guides/corebench.html) for more information! 🔥🔥🔥.
|
||||
- **\[2024.04.26\]** We deprecated the multi-madality evaluating function from OpenCompass, related implement has moved to [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), welcome to use! 🔥🔥🔥.
|
||||
- **\[2024.04.26\]** We supported the evaluation of [ArenaHard](configs/eval_subjective_arena_hard.py) welcome to try!🔥🔥🔥.
|
||||
- **\[2024.04.22\]** We supported the evaluation of [LLaMA3](configs/models/hf_llama/hf_llama3_8b.py) 和 [LLaMA3-Instruct](configs/models/hf_llama/hf_llama3_8b_instruct.py), welcome to try! 🔥🔥🔥
|
||||
@ -170,8 +171,11 @@ python run.py --datasets ceval_ppl mmlu_ppl \
|
||||
--num-gpus 1 # Number of minimum required GPUs
|
||||
```
|
||||
|
||||
> **Note**<br />
|
||||
> \[!TIP\]
|
||||
>
|
||||
> To run the command above, you will need to remove the comments starting from `# ` first.
|
||||
> configuration with `_ppl` is designed for base model typically.
|
||||
> configuration with `_gen` can be used for both base model and chat model.
|
||||
|
||||
Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html) to learn how to run an evaluation task.
|
||||
|
||||
|
@ -69,6 +69,7 @@
|
||||
|
||||
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
|
||||
|
||||
- **\[2024.04.26\]** 我们报告了典型LLM在常用基准测试上的表现,欢迎访问[文档](https://opencompass.readthedocs.io/zh-cn/latest/user_guides/corebench.html)以获取更多信息!🔥🔥🔥.
|
||||
- **\[2024.04.26\]** 我们废弃了 OpenCompass 进行多模态大模型评测的功能,相关功能转移至 [VLMEvalKit](https://github.com/open-compass/VLMEvalKit),推荐使用!🔥🔥🔥.
|
||||
- **\[2024.04.26\]** 我们支持了 [ArenaHard评测](configs/eval_subjective_arena_hard.py) 欢迎试用!🔥🔥🔥.
|
||||
- **\[2024.04.22\]** 我们支持了 [LLaMA3](configs/models/hf_llama/hf_llama3_8b.py) 和 [LLaMA3-Instruct](configs/models/hf_llama/hf_llama3_8b_instruct.py) 的评测,欢迎试用!🔥🔥🔥.
|
||||
|
@ -40,6 +40,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
|
||||
user_guides/experimentation.md
|
||||
user_guides/metrics.md
|
||||
user_guides/summarizer.md
|
||||
user_guides/corebench.md
|
||||
|
||||
.. _Prompt:
|
||||
.. toctree::
|
||||
|
23
docs/en/user_guides/corebench.md
Normal file
23
docs/en/user_guides/corebench.md
Normal file
@ -0,0 +1,23 @@
|
||||
# Performance of Common Benchmarks
|
||||
|
||||
We have identified several well-known benchmarks for evaluating large language models (LLMs), and provide detailed performance results of famous LLMs on these datasets.
|
||||
|
||||
| Model | Version | Metric | Mode | GPT-4-1106 | GPT-4-0409 | Claude-3-Opus | Llama-3-70b-Instruct(lmdeploy) | Mixtral-8x22B-Instruct-v0.1 |
|
||||
| -------------------- | ------- | ---------------------------- | ---- | ---------- | ---------- | ------------- | ------------------------------ | --------------------------- |
|
||||
| MMLU | - | naive_average | gen | 83.6 | 84.2 | 84.6 | 80.5 | 77.2 |
|
||||
| CMMLU | - | naive_average | gen | 71.9 | 72.4 | 74.2 | 70.1 | 59.7 |
|
||||
| CEval-Test | - | naive_average | gen | 69.7 | 70.5 | 71.7 | 66.9 | 58.7 |
|
||||
| GaokaoBench | - | weighted_average | gen | 74.8 | 76.0 | 74.2 | 67.8 | 60.0 |
|
||||
| Triviaqa_wiki(1shot) | 01cf41 | score | gen | 73.1 | 82.9 | 82.4 | 89.8 | 89.7 |
|
||||
| NQ_open(1shot) | eaf81e | score | gen | 27.9 | 30.4 | 39.4 | 40.1 | 46.8 |
|
||||
| Race-High | 9a54b6 | accuracy | gen | 89.3 | 89.6 | 90.8 | 89.4 | 84.8 |
|
||||
| WinoGrande | 6447e6 | accuracy | gen | 80.7 | 83.3 | 84.1 | 69.7 | 76.6 |
|
||||
| HellaSwag | e42710 | accuracy | gen | 92.7 | 93.5 | 94.6 | 87.7 | 86.1 |
|
||||
| BBH | - | naive_average | gen | 82.7 | 78.5 | 78.5 | 80.5 | 79.1 |
|
||||
| GSM-8K | 1d7fe4 | accuracy | gen | 80.5 | 79.7 | 87.7 | 90.2 | 88.3 |
|
||||
| Math | 393424 | accuracy | gen | 61.9 | 71.2 | 60.2 | 47.1 | 50 |
|
||||
| TheoremQA | ef26ca | accuracy | gen | 28.4 | 23.3 | 29.6 | 25.4 | 13 |
|
||||
| HumanEval | 8e312c | humaneval_pass@1 | gen | 74.4 | 82.3 | 76.2 | 72.6 | 72.0 |
|
||||
| MBPP(sanitized) | 1e1056 | score | gen | 78.6 | 77.0 | 76.7 | 71.6 | 68.9 |
|
||||
| GPQA_diamond | 4baadb | accuracy | gen | 40.4 | 48.5 | 46.5 | 38.9 | 36.4 |
|
||||
| IFEval | 3321a3 | Prompt-level-strict-accuracy | gen | 71.9 | 79.9 | 80.0 | 77.1 | 65.8 |
|
@ -41,6 +41,7 @@ OpenCompass 上手路线
|
||||
user_guides/experimentation.md
|
||||
user_guides/metrics.md
|
||||
user_guides/summarizer.md
|
||||
user_guides/corebench.md
|
||||
|
||||
.. _提示词:
|
||||
.. toctree::
|
||||
|
23
docs/zh_cn/user_guides/corebench.md
Normal file
23
docs/zh_cn/user_guides/corebench.md
Normal file
@ -0,0 +1,23 @@
|
||||
# 主要数据集性能
|
||||
|
||||
我们选择部分用于评估大型语言模型(LLMs)的知名基准,并提供了主要的LLMs在这些数据集上的详细性能结果。
|
||||
|
||||
| Model | Version | Metric | Mode | GPT-4-1106 | GPT-4-0409 | Claude-3-Opus | Llama-3-70b-Instruct(lmdeploy) | Mixtral-8x22B-Instruct-v0.1 |
|
||||
| -------------------- | ------- | ---------------------------- | ---- | ---------- | ---------- | ------------- | ------------------------------ | --------------------------- |
|
||||
| MMLU | - | naive_average | gen | 83.6 | 84.2 | 84.6 | 80.5 | 77.2 |
|
||||
| CMMLU | - | naive_average | gen | 71.9 | 72.4 | 74.2 | 70.1 | 59.7 |
|
||||
| CEval-Test | - | naive_average | gen | 69.7 | 70.5 | 71.7 | 66.9 | 58.7 |
|
||||
| GaokaoBench | - | weighted_average | gen | 74.8 | 76.0 | 74.2 | 67.8 | 60.0 |
|
||||
| Triviaqa_wiki(1shot) | 01cf41 | score | gen | 73.1 | 82.9 | 82.4 | 89.8 | 89.7 |
|
||||
| NQ_open(1shot) | eaf81e | score | gen | 27.9 | 30.4 | 39.4 | 40.1 | 46.8 |
|
||||
| Race-High | 9a54b6 | accuracy | gen | 89.3 | 89.6 | 90.8 | 89.4 | 84.8 |
|
||||
| WinoGrande | 6447e6 | accuracy | gen | 80.7 | 83.3 | 84.1 | 69.7 | 76.6 |
|
||||
| HellaSwag | e42710 | accuracy | gen | 92.7 | 93.5 | 94.6 | 87.7 | 86.1 |
|
||||
| BBH | - | naive_average | gen | 82.7 | 78.5 | 78.5 | 80.5 | 79.1 |
|
||||
| GSM-8K | 1d7fe4 | accuracy | gen | 80.5 | 79.7 | 87.7 | 90.2 | 88.3 |
|
||||
| Math | 393424 | accuracy | gen | 61.9 | 71.2 | 60.2 | 47.1 | 50 |
|
||||
| TheoremQA | ef26ca | accuracy | gen | 28.4 | 23.3 | 29.6 | 25.4 | 13 |
|
||||
| HumanEval | 8e312c | humaneval_pass@1 | gen | 74.4 | 82.3 | 76.2 | 72.6 | 72.0 |
|
||||
| MBPP(sanitized) | 1e1056 | score | gen | 78.6 | 77.0 | 76.7 | 71.6 | 68.9 |
|
||||
| GPQA_diamond | 4baadb | accuracy | gen | 40.4 | 48.5 | 46.5 | 38.9 | 36.4 |
|
||||
| IFEval | 3321a3 | Prompt-level-strict-accuracy | gen | 71.9 | 79.9 | 80.0 | 77.1 | 65.8 |
|
Loading…
Reference in New Issue
Block a user