diff --git a/README.md b/README.md index a996af95..b75606b8 100644 --- a/README.md +++ b/README.md @@ -70,6 +70,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through ## 🚀 What's New +- **\[2024.04.29\]** We report the performance of several famous LLMs on the common benchmarks, welcome to [documentation](https://opencompass.readthedocs.io/en/latest/user_guides/corebench.html) for more information! 🔥🔥🔥. - **\[2024.04.26\]** We deprecated the multi-madality evaluating function from OpenCompass, related implement has moved to [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), welcome to use! 🔥🔥🔥. - **\[2024.04.26\]** We supported the evaluation of [ArenaHard](configs/eval_subjective_arena_hard.py) welcome to try!🔥🔥🔥. - **\[2024.04.22\]** We supported the evaluation of [LLaMA3](configs/models/hf_llama/hf_llama3_8b.py) 和 [LLaMA3-Instruct](configs/models/hf_llama/hf_llama3_8b_instruct.py), welcome to try! 🔥🔥🔥 @@ -170,8 +171,11 @@ python run.py --datasets ceval_ppl mmlu_ppl \ --num-gpus 1 # Number of minimum required GPUs ``` -> **Note**
+> \[!TIP\] +> > To run the command above, you will need to remove the comments starting from `# ` first. +> configuration with `_ppl` is designed for base model typically. +> configuration with `_gen` can be used for both base model and chat model. Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html) to learn how to run an evaluation task. diff --git a/README_zh-CN.md b/README_zh-CN.md index dcec720f..d5907202 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -69,6 +69,7 @@ ## 🚀 最新进展 +- **\[2024.04.26\]** 我们报告了典型LLM在常用基准测试上的表现,欢迎访问[文档](https://opencompass.readthedocs.io/zh-cn/latest/user_guides/corebench.html)以获取更多信息!🔥🔥🔥. - **\[2024.04.26\]** 我们废弃了 OpenCompass 进行多模态大模型评测的功能,相关功能转移至 [VLMEvalKit](https://github.com/open-compass/VLMEvalKit),推荐使用!🔥🔥🔥. - **\[2024.04.26\]** 我们支持了 [ArenaHard评测](configs/eval_subjective_arena_hard.py) 欢迎试用!🔥🔥🔥. - **\[2024.04.22\]** 我们支持了 [LLaMA3](configs/models/hf_llama/hf_llama3_8b.py) 和 [LLaMA3-Instruct](configs/models/hf_llama/hf_llama3_8b_instruct.py) 的评测,欢迎试用!🔥🔥🔥. diff --git a/docs/en/index.rst b/docs/en/index.rst index e3d9ba7c..4dd2dbec 100644 --- a/docs/en/index.rst +++ b/docs/en/index.rst @@ -40,6 +40,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass. user_guides/experimentation.md user_guides/metrics.md user_guides/summarizer.md + user_guides/corebench.md .. _Prompt: .. toctree:: diff --git a/docs/en/user_guides/corebench.md b/docs/en/user_guides/corebench.md new file mode 100644 index 00000000..faaf151a --- /dev/null +++ b/docs/en/user_guides/corebench.md @@ -0,0 +1,23 @@ +# Performance of Common Benchmarks + +We have identified several well-known benchmarks for evaluating large language models (LLMs), and provide detailed performance results of famous LLMs on these datasets. + +| Model | Version | Metric | Mode | GPT-4-1106 | GPT-4-0409 | Claude-3-Opus | Llama-3-70b-Instruct(lmdeploy) | Mixtral-8x22B-Instruct-v0.1 | +| -------------------- | ------- | ---------------------------- | ---- | ---------- | ---------- | ------------- | ------------------------------ | --------------------------- | +| MMLU | - | naive_average | gen | 83.6 | 84.2 | 84.6 | 80.5 | 77.2 | +| CMMLU | - | naive_average | gen | 71.9 | 72.4 | 74.2 | 70.1 | 59.7 | +| CEval-Test | - | naive_average | gen | 69.7 | 70.5 | 71.7 | 66.9 | 58.7 | +| GaokaoBench | - | weighted_average | gen | 74.8 | 76.0 | 74.2 | 67.8 | 60.0 | +| Triviaqa_wiki(1shot) | 01cf41 | score | gen | 73.1 | 82.9 | 82.4 | 89.8 | 89.7 | +| NQ_open(1shot) | eaf81e | score | gen | 27.9 | 30.4 | 39.4 | 40.1 | 46.8 | +| Race-High | 9a54b6 | accuracy | gen | 89.3 | 89.6 | 90.8 | 89.4 | 84.8 | +| WinoGrande | 6447e6 | accuracy | gen | 80.7 | 83.3 | 84.1 | 69.7 | 76.6 | +| HellaSwag | e42710 | accuracy | gen | 92.7 | 93.5 | 94.6 | 87.7 | 86.1 | +| BBH | - | naive_average | gen | 82.7 | 78.5 | 78.5 | 80.5 | 79.1 | +| GSM-8K | 1d7fe4 | accuracy | gen | 80.5 | 79.7 | 87.7 | 90.2 | 88.3 | +| Math | 393424 | accuracy | gen | 61.9 | 71.2 | 60.2 | 47.1 | 50 | +| TheoremQA | ef26ca | accuracy | gen | 28.4 | 23.3 | 29.6 | 25.4 | 13 | +| HumanEval | 8e312c | humaneval_pass@1 | gen | 74.4 | 82.3 | 76.2 | 72.6 | 72.0 | +| MBPP(sanitized) | 1e1056 | score | gen | 78.6 | 77.0 | 76.7 | 71.6 | 68.9 | +| GPQA_diamond | 4baadb | accuracy | gen | 40.4 | 48.5 | 46.5 | 38.9 | 36.4 | +| IFEval | 3321a3 | Prompt-level-strict-accuracy | gen | 71.9 | 79.9 | 80.0 | 77.1 | 65.8 | diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst index 6f2d9f21..da1dad23 100644 --- a/docs/zh_cn/index.rst +++ b/docs/zh_cn/index.rst @@ -41,6 +41,7 @@ OpenCompass 上手路线 user_guides/experimentation.md user_guides/metrics.md user_guides/summarizer.md + user_guides/corebench.md .. _提示词: .. toctree:: diff --git a/docs/zh_cn/user_guides/corebench.md b/docs/zh_cn/user_guides/corebench.md new file mode 100644 index 00000000..41065020 --- /dev/null +++ b/docs/zh_cn/user_guides/corebench.md @@ -0,0 +1,23 @@ +# 主要数据集性能 + +我们选择部分用于评估大型语言模型(LLMs)的知名基准,并提供了主要的LLMs在这些数据集上的详细性能结果。 + +| Model | Version | Metric | Mode | GPT-4-1106 | GPT-4-0409 | Claude-3-Opus | Llama-3-70b-Instruct(lmdeploy) | Mixtral-8x22B-Instruct-v0.1 | +| -------------------- | ------- | ---------------------------- | ---- | ---------- | ---------- | ------------- | ------------------------------ | --------------------------- | +| MMLU | - | naive_average | gen | 83.6 | 84.2 | 84.6 | 80.5 | 77.2 | +| CMMLU | - | naive_average | gen | 71.9 | 72.4 | 74.2 | 70.1 | 59.7 | +| CEval-Test | - | naive_average | gen | 69.7 | 70.5 | 71.7 | 66.9 | 58.7 | +| GaokaoBench | - | weighted_average | gen | 74.8 | 76.0 | 74.2 | 67.8 | 60.0 | +| Triviaqa_wiki(1shot) | 01cf41 | score | gen | 73.1 | 82.9 | 82.4 | 89.8 | 89.7 | +| NQ_open(1shot) | eaf81e | score | gen | 27.9 | 30.4 | 39.4 | 40.1 | 46.8 | +| Race-High | 9a54b6 | accuracy | gen | 89.3 | 89.6 | 90.8 | 89.4 | 84.8 | +| WinoGrande | 6447e6 | accuracy | gen | 80.7 | 83.3 | 84.1 | 69.7 | 76.6 | +| HellaSwag | e42710 | accuracy | gen | 92.7 | 93.5 | 94.6 | 87.7 | 86.1 | +| BBH | - | naive_average | gen | 82.7 | 78.5 | 78.5 | 80.5 | 79.1 | +| GSM-8K | 1d7fe4 | accuracy | gen | 80.5 | 79.7 | 87.7 | 90.2 | 88.3 | +| Math | 393424 | accuracy | gen | 61.9 | 71.2 | 60.2 | 47.1 | 50 | +| TheoremQA | ef26ca | accuracy | gen | 28.4 | 23.3 | 29.6 | 25.4 | 13 | +| HumanEval | 8e312c | humaneval_pass@1 | gen | 74.4 | 82.3 | 76.2 | 72.6 | 72.0 | +| MBPP(sanitized) | 1e1056 | score | gen | 78.6 | 77.0 | 76.7 | 71.6 | 68.9 | +| GPQA_diamond | 4baadb | accuracy | gen | 40.4 | 48.5 | 46.5 | 38.9 | 36.4 | +| IFEval | 3321a3 | Prompt-level-strict-accuracy | gen | 71.9 | 79.9 | 80.0 | 77.1 | 65.8 |