OpenCompass/docs/en/user_guides/corebench.md
Songyang Zhang 063f5f5f49
[Update] Update performance of common benchmarks (#1109)
* [Update] Update performance of common benchmarks

* [Update] Update performance of common benchmarks

* [Update] Update performance of common benchmarks
2024-04-30 00:09:08 +08:00

3.5 KiB

Performance of Common Benchmarks

We have identified several well-known benchmarks for evaluating large language models (LLMs), and provide detailed performance results of famous LLMs on these datasets.

Model Version Metric Mode GPT-4-1106 GPT-4-0409 Claude-3-Opus Llama-3-70b-Instruct(lmdeploy) Mixtral-8x22B-Instruct-v0.1
MMLU - naive_average gen 83.6 84.2 84.6 80.5 77.2
CMMLU - naive_average gen 71.9 72.4 74.2 70.1 59.7
CEval-Test - naive_average gen 69.7 70.5 71.7 66.9 58.7
GaokaoBench - weighted_average gen 74.8 76.0 74.2 67.8 60.0
Triviaqa_wiki(1shot) 01cf41 score gen 73.1 82.9 82.4 89.8 89.7
NQ_open(1shot) eaf81e score gen 27.9 30.4 39.4 40.1 46.8
Race-High 9a54b6 accuracy gen 89.3 89.6 90.8 89.4 84.8
WinoGrande 6447e6 accuracy gen 80.7 83.3 84.1 69.7 76.6
HellaSwag e42710 accuracy gen 92.7 93.5 94.6 87.7 86.1
BBH - naive_average gen 82.7 78.5 78.5 80.5 79.1
GSM-8K 1d7fe4 accuracy gen 80.5 79.7 87.7 90.2 88.3
Math 393424 accuracy gen 61.9 71.2 60.2 47.1 50
TheoremQA ef26ca accuracy gen 28.4 23.3 29.6 25.4 13
HumanEval 8e312c humaneval_pass@1 gen 74.4 82.3 76.2 72.6 72.0
MBPP(sanitized) 1e1056 score gen 78.6 77.0 76.7 71.6 68.9
GPQA_diamond 4baadb accuracy gen 40.4 48.5 46.5 38.9 36.4
IFEval 3321a3 Prompt-level-strict-accuracy gen 71.9 79.9 80.0 77.1 65.8