OpenCompass/opencompass/configs/datasets/ceval
Songyang Zhang 46cc7894e1
[Feature] Support import configs/models/summarizers from whl (#1376)
* [Feature] Support import configs/models/summarizers from whl

* Update LCBench configs

* Update

* Update

* Update

* Update

* update

* Update

* Update

* Update

* Update

* Update
2024-08-01 00:42:48 +08:00
..
ceval_clean_ppl.py [Feature] Support import configs/models/summarizers from whl (#1376) 2024-08-01 00:42:48 +08:00
ceval_gen_2daf24.py [Feature] Support import configs/models/summarizers from whl (#1376) 2024-08-01 00:42:48 +08:00
ceval_gen_5f30c7.py [Feature] Support import configs/models/summarizers from whl (#1376) 2024-08-01 00:42:48 +08:00
ceval_gen.py [Feature] Support import configs/models/summarizers from whl (#1376) 2024-08-01 00:42:48 +08:00
ceval_internal_ppl_1cd8bf.py [Feature] Support import configs/models/summarizers from whl (#1376) 2024-08-01 00:42:48 +08:00
ceval_internal_ppl_93e5ce.py [Feature] Support import configs/models/summarizers from whl (#1376) 2024-08-01 00:42:48 +08:00
ceval_ppl_1cd8bf.py [Feature] Support import configs/models/summarizers from whl (#1376) 2024-08-01 00:42:48 +08:00
ceval_ppl_93e5ce.py [Feature] Support import configs/models/summarizers from whl (#1376) 2024-08-01 00:42:48 +08:00
ceval_ppl_578f8d.py [Feature] Support import configs/models/summarizers from whl (#1376) 2024-08-01 00:42:48 +08:00
ceval_ppl.py [Feature] Support import configs/models/summarizers from whl (#1376) 2024-08-01 00:42:48 +08:00
ceval_zero_shot_gen_bd40ef.py [Feature] Support import configs/models/summarizers from whl (#1376) 2024-08-01 00:42:48 +08:00
README.md [Feature] Support import configs/models/summarizers from whl (#1376) 2024-08-01 00:42:48 +08:00

C-Eval

python3 run.py --models hf_internlm2_7b --datasets ceval_internal_ppl_93e5ce --debug
python3 run.py --models hf_internlm2_chat_7b --datasets ceval_internal_gen_2daf24 --debug

Base Models

model ceval-test ceval-test-hard ceval-test-stem ceval-test-social-science ceval-test-humanities ceval-test-other ceval-dev ceval-dev-hard ceval-dev-stem ceval-dev-social-science ceval-dev-humanities ceval-dev-other
llama-7b-turbomind 26.61 27.75 27.20 26.31 25.90 26.52 27.44 27.68 27.16 29.49 24.18 29.36
llama-13b-turbomind 29.18 25.59 27.66 33.86 28.29 28.58 31.75 30.32 31.39 35.22 30.16 30.82
llama-30b-turbomind 35.09 31.68 34.56 39.89 33.02 33.76 37.70 31.97 34.80 42.72 41.19 34.93
llama-65b-turbomind 37.98 29.47 36.03 45.03 36.51 36.56 40.46 33.76 36.37 46.47 42.26 40.63
llama-2-7b-turbomind 30.13 26.26 29.29 33.02 31.02 28.15 32.70 25.85 28.75 39.75 37.04 29.13
llama-2-13b-turbomind 37.38 30.81 35.85 43.98 36.81 34.75 40.43 31.34 35.67 45.75 45.32 39.36
llama-2-70b-turbomind 49.53 33.48 44.73 60.19 50.93 47.17 50.26 32.53 44.83 59.44 54.45 47.58
llama-3-8b-turbomind 48.83 34.47 46.02 56.48 49.15 46.69 50.45 33.76 45.94 58.08 50.93 51.25
llama-3-70b-turbomind 66.56 54.09 64.08 76.43 64.38 64.25 67.30 52.35 62.67 77.89 69.76 63.65
internlm2-1.8b-turbomind 44.79 33.93 41.19 54.26 47.15 40.35 46.64 33.00 38.62 57.28 51.30 46.89
internlm2-7b-turbomind 63.54 45.32 58.10 76.40 66.94 58.32 64.23 40.09 54.37 76.88 70.11 64.77
internlm2-20b-turbomind 67.28 50.15 62.33 79.59 70.55 61.82 66.73 42.50 59.25 79.98 73.43 61.56
qwen-1.8b-turbomind 54.24 38.60 50.02 68.18 55.33 48.13 53.78 33.38 46.36 68.40 57.57 50.17
qwen-7b-turbomind 62.06 42.73 56.21 77.12 65.28 55.76 63.23 36.99 54.74 78.55 68.94 59.02
qwen-14b-turbomind 70.33 53.61 65.25 83.19 72.85 65.37 72.05 55.03 66.07 85.59 74.91 67.78
qwen-72b-turbomind 83.25 66.78 78.44 91.75 83.86 83.63 83.60 63.68 78.05 90.25 87.13 84.13
qwen1.5-0.5b-hf 48.36 35.55 44.72 62.00 48.51 42.41 50.43 37.00 46.28 62.64 48.11 49.18
qwen1.5-1.8b-hf 58.67 40.98 53.91 74.52 58.51 53.06 59.38 43.02 53.45 75.88 60.06 54.47
qwen1.5-4b-hf 66.55 48.50 61.45 81.12 67.90 61.22 66.46 43.12 56.76 82.89 67.61 68.03
qwen1.5-7b-hf 72.49 52.90 66.77 85.50 74.37 69.19 73.57 49.16 66.32 84.23 77.30 73.34
qwen1.5-14b-hf 76.93 60.50 72.08 88.81 77.95 73.94 77.86 54.81 71.55 86.79 82.86 76.23
qwen1.5-32b-hf 82.50 66.67 77.97 90.93 83.66 81.88 82.79 71.06 80.01 89.02 83.36 81.62
qwen1.5-72b-hf 83.03 65.09 77.90 91.47 83.85 83.86 83.72 64.09 77.26 91.87 87.64 84.14
qwen1.5-moe-a2-7b-hf 76.67 51.37 68.89 88.33 77.15 79.73 77.90 51.25 67.27 89.28 83.16 81.60
mistral-7b-v0.1-hf 43.76 33.85 42.23 49.97 41.10 43.54 47.54 33.97 44.74 54.80 51.52 42.06
mistral-7b-v0.2-hf 42.81 32.84 41.00 50.19 39.45 42.77 46.44 31.67 42.89 54.50 48.75 43.23
mixtral-8x7b-v0.1-hf 51.15 41.46 50.93 59.19 46.69 48.72 55.31 42.04 52.78 62.00 56.44 52.71
mixtral-8x22b-v0.1-hf 58.13 48.31 58.01 66.94 53.60 54.86 60.50 45.67 57.44 71.27 61.31 55.47
yi-6b-hf 70.78 43.72 60.54 83.29 75.39 73.40 73.13 46.87 63.14 85.52 78.70 74.45
yi-34b-hf 80.93 58.51 73.48 89.24 83.65 84.18 81.62 56.95 71.64 89.73 87.49 86.53
deepseek-7b-base-hf 43.68 28.90 37.03 53.55 50.14 40.34 45.07 31.94 38.81 56.68 47.10 43.85
deepseek-67b-base-hf 66.66 44.25 57.89 79.02 72.36 65.66 66.65 38.62 56.65 79.56 73.72 66.01

Details on Test Split

model computer_network operating_system computer_architecture college_programming college_physics college_chemistry advanced_mathematics probability_and_statistics discrete_mathematics electrical_engineer metrology_engineer high_school_mathematics
llama-7b-turbomind 29.82 25.70 26.94 30.99 32.95 23.66 26.01 22.89 27.45 30.09 26.48 33.13
llama-13b-turbomind 33.33 37.99 31.09 29.82 22.16 27.23 31.79 27.11 24.84 28.02 33.33 30.72
llama-30b-turbomind 40.94 48.60 40.41 34.21 32.95 35.71 36.42 32.53 27.45 31.56 36.07 30.12
llama-65b-turbomind 41.52 50.84 44.04 40.94 27.84 29.46 28.32 30.72 29.41 35.10 42.47 30.12
llama-2-7b-turbomind 33.92 37.99 34.72 30.99 26.70 21.88 31.79 25.30 24.18 31.56 39.73 30.12
llama-2-13b-turbomind 40.94 46.93 37.82 36.26 30.68 29.46 35.84 30.72 24.84 32.74 42.92 34.94
llama-2-70b-turbomind 55.56 58.66 53.89 47.95 34.09 33.48 32.95 27.11 34.64 37.76 57.99 29.52
llama-3-8b-turbomind 55.56 58.66 55.96 51.17 27.27 35.27 36.42 31.33 34.64 40.12 50.68 30.72
llama-3-70b-turbomind 69.59 75.98 69.95 71.64 49.43 58.04 52.02 53.01 58.82 45.72 68.95 40.96
internlm2-1.8b-turbomind 40.35 40.78 39.38 32.16 34.66 34.38 31.21 31.33 35.95 35.10 51.60 27.71
internlm2-7b-turbomind 56.14 57.54 62.69 49.42 43.75 48.21 34.68 32.53 33.33 41.00 60.27 40.36
internlm2-20b-turbomind 62.57 65.36 66.84 58.77 43.18 51.79 39.31 40.36 35.95 42.77 66.67 47.59
qwen-1.8b-turbomind 46.20 41.90 46.63 36.84 40.34 36.61 27.75 28.92 32.68 36.58 57.08 30.12
qwen-7b-turbomind 52.63 54.75 54.40 46.20 35.80 44.20 36.99 27.71 26.80 38.35 57.99 33.13
qwen-14b-turbomind 58.48 64.80 59.07 54.68 45.45 57.59 45.09 33.73 39.22 49.26 67.58 45.78
qwen-72b-turbomind 83.04 73.74 79.27 76.61 75.00 64.29 49.13 44.58 46.41 66.37 85.84 68.07
qwen1.5-0.5b-hf 37.43 40.22 41.45 35.09 40.91 34.82 30.06 27.11 26.80 29.79 54.34 31.93
qwen1.5-1.8b-hf 47.37 50.84 47.67 38.30 43.18 35.27 29.48 30.12 33.99 39.53 58.90 28.92
qwen1.5-4b-hf 62.57 56.98 56.99 46.78 48.30 45.98 40.46 34.34 31.37 46.61 62.10 43.37
qwen1.5-7b-hf 66.08 62.57 66.32 55.56 54.55 47.77 41.62 31.93 35.95 49.85 74.43 49.40
qwen1.5-14b-hf 71.35 66.48 68.39 64.91 57.95 65.62 41.62 40.36 47.71 56.64 79.45 56.63
qwen1.5-32b-hf 84.80 73.18 74.61 70.18 71.59 61.61 49.13 45.78 49.02 61.95 87.67 72.89
qwen1.5-72b-hf 85.38 73.74 78.24 78.36 72.73 63.39 43.35 40.96 49.02 65.78 85.84 66.27
qwen1.5-moe-a2-7b-hf 77.78 73.74 68.91 64.91 66.48 49.11 33.53 36.75 35.95 61.06 91.32 40.96
mistral-7b-v0.1-hf 55.56 55.31 56.99 48.25 39.77 39.29 33.53 25.90 31.37 35.99 45.21 27.11
mistral-7b-v0.2-hf 56.14 53.63 55.44 47.66 36.36 34.38 32.37 25.30 33.33 31.86 45.21 29.52
mixtral-8x7b-v0.1-hf 62.57 64.80 60.10 60.53 38.64 42.41 40.46 37.35 45.75 35.99 60.27 34.94
mixtral-8x22b-v0.1-hf 65.50 74.86 63.73 65.79 46.59 52.68 52.02 45.78 52.94 42.77 62.56 39.16
yi-6b-hf 68.42 63.13 69.43 57.89 42.05 48.66 31.79 33.13 28.76 49.85 74.89 37.35
yi-34b-hf 83.63 80.45 74.09 68.42 62.50 60.27 45.09 38.55 50.33 65.19 88.58 49.40
deepseek-7b-base-hf 44.44 44.13 44.56 36.26 30.68 29.02 32.37 24.70 26.14 35.99 48.86 28.31
deepseek-67b-base-hf 63.16 70.39 65.80 59.36 42.61 45.54 35.84 38.55 42.48 44.54 68.95 33.73
model high_school_physics high_school_chemistry high_school_biology middle_school_mathematics middle_school_biology middle_school_physics middle_school_chemistry veterinary_medicine college_economics business_administration marxism mao_zedong_thought
llama-7b-turbomind 29.14 26.74 24.57 29.94 22.92 23.60 20.00 30.95 29.98 24.58 25.70 25.11
llama-13b-turbomind 22.29 18.60 28.00 26.55 26.56 25.28 19.46 29.05 28.77 28.57 39.66 43.38
llama-30b-turbomind 25.14 33.14 36.00 31.07 39.06 28.09 33.51 38.10 35.21 35.88 48.04 33.33
llama-65b-turbomind 33.71 26.16 38.29 33.90 44.27 36.52 38.92 38.10 37.42 42.19 59.22 48.40
llama-2-7b-turbomind 26.86 23.26 26.86 28.81 28.12 29.78 22.70 30.48 31.79 30.56 33.52 36.07
llama-2-13b-turbomind 28.00 31.98 36.57 36.72 38.54 36.52 37.84 46.67 37.02 36.54 57.54 41.10
llama-2-70b-turbomind 40.00 36.05 48.00 36.72 66.67 55.06 55.68 52.86 51.91 48.50 68.16 60.73
llama-3-8b-turbomind 41.71 38.37 50.86 36.16 61.98 63.48 63.78 56.19 41.65 49.17 69.27 54.34
llama-3-70b-turbomind 63.43 56.98 69.14 59.32 84.90 75.28 78.92 79.52 68.81 59.80 86.59 79.91
internlm2-1.8b-turbomind 30.29 45.93 46.29 33.33 63.02 60.11 62.70 47.62 35.61 37.87 69.27 61.64
internlm2-7b-turbomind 64.57 65.12 76.00 54.80 91.15 85.96 90.27 74.29 57.34 50.50 86.59 83.56
internlm2-20b-turbomind 68.57 74.42 78.86 58.76 91.67 90.45 90.27 72.38 57.95 55.81 88.83 88.58
qwen-1.8b-turbomind 55.43 56.98 61.14 54.80 85.42 84.83 85.41 54.76 43.06 44.19 83.80 79.91
qwen-7b-turbomind 68.00 69.19 82.86 57.63 93.75 87.64 92.43 63.81 47.28 57.48 86.59 82.65
qwen-14b-turbomind 78.86 83.14 92.57 67.23 96.88 95.51 96.76 73.33 56.94 64.45 91.62 86.76
qwen-72b-turbomind 93.14 93.60 95.43 88.70 98.44 97.75 99.46 90.00 75.45 80.73 96.09 99.54
qwen1.5-0.5b-hf 48.57 44.19 60.00 40.68 73.44 69.66 78.92 49.05 34.41 40.20 79.89 74.43
qwen1.5-1.8b-hf 58.86 68.02 76.00 59.32 91.15 90.45 87.03 63.81 44.87 48.50 86.03 90.41
qwen1.5-4b-hf 66.86 77.33 82.86 68.93 95.31 92.70 97.30 71.90 51.31 61.13 91.62 94.52
qwen1.5-7b-hf 79.43 82.56 91.43 77.40 96.88 95.51 96.22 80.00 62.37 69.77 93.30 97.26
qwen1.5-14b-hf 86.29 87.79 93.14 83.05 97.92 95.51 97.84 82.86 63.78 77.08 95.53 96.35
qwen1.5-32b-hf 88.00 95.35 94.86 91.53 97.92 99.44 100.00 90.00 73.44 78.74 94.97 98.63
qwen1.5-72b-hf 91.43 93.60 95.43 88.70 97.92 98.31 99.46 90.00 74.25 80.40 94.41 98.63
qwen1.5-moe-a2-7b-hf 70.86 77.33 82.86 68.36 97.92 93.26 97.30 89.52 70.22 74.75 96.09 98.17
mistral-7b-v0.1-hf 33.14 40.70 40.57 40.11 47.92 49.44 50.81 47.62 44.87 37.87 58.10 48.40
mistral-7b-v0.2-hf 34.86 36.63 45.71 36.72 46.35 46.07 48.65 43.81 43.46 39.53 57.54 48.86
mixtral-8x7b-v0.1-hf 49.71 42.44 53.71 47.46 62.50 61.24 60.00 57.62 52.52 44.52 68.72 57.99
mixtral-8x22b-v0.1-hf 54.29 43.02 58.29 55.93 76.04 66.29 75.68 66.19 60.97 51.83 74.30 70.78
yi-6b-hf 58.86 69.19 78.29 43.50 92.19 89.33 90.27 83.81 59.56 70.10 93.85 97.72
yi-34b-hf 80.00 81.98 93.14 65.54 97.40 95.51 96.76 92.86 74.04 76.08 94.97 97.26
deepseek-7b-base-hf 29.14 30.81 33.14 24.29 53.12 45.51 48.65 50.48 38.23 44.19 62.01 65.30
deepseek-67b-base-hf 60.00 55.23 64.00 46.33 84.90 79.78 83.24 73.33 57.75 63.79 89.94 88.58
model education_science teacher_qualification high_school_politics high_school_geography middle_school_politics middle_school_geography modern_chinese_history ideological_and_moral_cultivation logic law chinese_language_and_literature art_studies
llama-7b-turbomind 22.96 31.58 25.57 29.78 22.80 25.00 21.70 21.51 25.00 26.24 22.49 25.84
llama-13b-turbomind 29.26 30.83 33.52 36.52 34.72 33.33 24.06 40.12 26.47 33.48 30.14 29.87
llama-30b-turbomind 37.41 46.37 32.95 38.20 50.78 40.74 28.77 45.93 33.33 32.13 39.23 22.82
llama-65b-turbomind 39.63 51.13 31.82 39.89 58.03 42.59 34.91 55.23 39.71 30.32 37.80 32.89
llama-2-7b-turbomind 27.78 34.34 31.82 34.83 35.23 34.26 28.77 38.95 32.35 33.94 27.27 30.87
llama-2-13b-turbomind 41.48 47.37 37.50 37.64 50.78 52.78 43.40 48.84 32.35 38.46 36.36 30.20
llama-2-70b-turbomind 57.78 69.17 50.57 58.43 69.95 66.67 50.94 72.09 50.98 42.53 44.98 52.01
llama-3-8b-turbomind 56.30 65.41 47.16 56.18 64.25 61.11 55.66 67.44 41.67 40.27 45.45 50.34
llama-3-70b-turbomind 72.22 85.46 75.00 74.72 84.97 76.85 75.00 76.16 59.31 52.94 62.68 68.46
internlm2-1.8b-turbomind 47.41 61.40 55.11 47.75 61.66 64.81 61.79 63.95 32.35 32.58 48.33 36.58
internlm2-7b-turbomind 66.67 85.96 78.98 74.72 91.71 87.96 80.66 80.23 42.16 50.23 64.11 70.13
internlm2-20b-turbomind 69.26 89.22 83.52 80.34 90.67 91.67 83.02 85.47 49.02 54.30 72.25 73.15
qwen-1.8b-turbomind 51.11 70.68 71.02 62.36 88.60 87.04 69.81 73.26 29.90 46.15 50.24 47.32
qwen-7b-turbomind 57.41 83.71 88.64 79.78 93.26 94.44 75.47 79.07 42.16 47.96 59.33 65.10
qwen-14b-turbomind 72.96 89.97 93.75 83.71 96.37 95.37 86.32 87.21 50.00 60.63 66.99 72.48
qwen-72b-turbomind 85.56 96.24 95.45 93.26 97.93 97.22 92.45 91.86 67.65 76.92 75.12 83.89
qwen1.5-0.5b-hf 43.33 63.16 65.91 56.18 82.90 79.63 68.87 70.35 28.43 37.56 39.23 32.21
qwen1.5-1.8b-hf 57.41 76.44 81.25 75.84 92.75 91.67 79.72 81.98 34.31 47.96 47.85 43.62
qwen1.5-4b-hf 65.93 87.47 86.93 82.58 94.30 95.37 84.91 84.30 40.20 62.90 58.85 58.72
qwen1.5-7b-hf 69.26 91.98 90.91 89.89 95.85 94.44 89.15 87.21 48.04 67.87 63.16 68.12
qwen1.5-14b-hf 78.89 94.99 94.89 91.57 96.89 98.15 91.04 88.37 57.84 69.68 66.99 73.83
qwen1.5-32b-hf 83.70 95.99 93.75 94.38 98.45 97.22 90.57 91.28 70.10 76.92 76.56 80.87
qwen1.5-72b-hf 84.44 96.49 96.59 93.82 98.45 97.22 92.92 91.28 66.67 76.92 74.16 85.23
qwen1.5-moe-a2-7b-hf 80.74 95.49 89.20 89.33 94.82 94.44 92.45 91.28 52.45 75.57 67.94 79.87
mistral-7b-v0.1-hf 45.19 59.15 43.75 49.44 56.48 56.48 45.28 58.14 37.75 38.91 40.67 34.56
mistral-7b-v0.2-hf 45.93 58.65 38.07 48.31 63.21 58.33 41.98 54.07 35.78 40.27 38.28 32.21
mixtral-8x7b-v0.1-hf 57.04 67.92 53.41 55.06 69.95 64.81 47.64 70.93 42.16 38.01 46.41 36.58
mixtral-8x22b-v0.1-hf 60.37 72.68 64.77 65.17 77.20 71.30 57.08 75.00 49.51 43.44 52.63 49.33
yi-6b-hf 79.26 92.48 77.27 76.40 92.75 93.52 89.15 90.12 60.78 74.66 61.24 74.16
yi-34b-hf 84.81 96.24 88.07 88.20 96.37 96.30 91.98 91.28 75.00 78.73 80.38 82.89
deepseek-7b-base-hf 52.22 70.18 47.16 51.12 60.62 44.44 58.49 66.86 31.86 37.56 53.11 61.07
deepseek-67b-base-hf 76.67 89.22 77.27 78.65 89.64 78.70 85.85 84.30 50.00 64.25 69.38 84.23
model professional_tour_guide legal_professional high_school_chinese high_school_history middle_school_history civil_servant sports_science plant_protection basic_medicine clinical_medicine urban_and_rural_planner accountant
llama-7b-turbomind 29.70 23.72 27.53 30.22 30.92 27.04 22.78 28.64 28.00 25.00 26.32 29.80
llama-13b-turbomind 25.94 20.93 25.84 29.67 24.64 29.60 26.67 29.15 33.71 25.50 28.47 28.44
llama-30b-turbomind 29.32 27.91 30.34 36.26 37.20 36.13 36.11 38.69 34.29 29.50 38.52 29.35
llama-65b-turbomind 28.95 30.70 30.90 44.51 35.75 36.60 45.56 39.20 37.71 30.00 39.47 37.02
llama-2-7b-turbomind 29.70 30.23 24.72 29.67 34.78 30.07 31.11 31.16 30.29 25.50 31.34 27.31
llama-2-13b-turbomind 30.83 32.56 24.16 42.31 45.41 32.87 36.67 45.23 38.29 33.50 35.17 34.31
llama-2-70b-turbomind 53.76 38.14 30.34 58.79 65.70 43.82 51.11 58.29 49.71 42.00 49.76 46.28
llama-3-8b-turbomind 52.63 42.33 27.53 51.65 65.70 44.52 54.44 51.26 46.86 43.00 46.41 45.15
llama-3-70b-turbomind 72.93 52.56 32.58 71.98 83.57 56.88 69.44 78.89 76.00 67.50 57.89 59.14
internlm2-1.8b-turbomind 51.50 38.14 25.84 56.04 71.50 47.32 35.00 43.72 42.29 39.00 41.15 36.57
internlm2-7b-turbomind 72.56 53.49 52.25 79.67 90.82 62.00 62.78 64.32 66.86 59.50 55.74 53.50
internlm2-20b-turbomind 74.06 54.42 56.18 81.87 92.27 61.77 68.33 69.85 68.00 63.50 60.77 58.92
qwen-1.8b-turbomind 54.14 43.72 39.89 69.23 85.02 49.88 45.56 48.74 48.57 51.50 46.89 45.82
qwen-7b-turbomind 71.05 48.37 53.93 81.87 93.72 59.67 54.44 62.31 58.29 57.50 50.24 56.66
qwen-14b-turbomind 79.70 53.02 63.48 87.36 94.20 71.33 63.33 71.36 73.14 68.00 59.09 67.95
qwen-72b-turbomind 90.23 77.21 79.21 91.76 96.14 77.86 86.11 85.43 91.43 90.50 76.08 86.68
qwen1.5-0.5b-hf 44.36 36.74 39.33 58.24 78.26 43.36 40.00 45.23 41.71 42.50 43.54 43.12
qwen1.5-1.8b-hf 59.40 47.91 37.08 72.53 91.30 53.61 53.33 51.26 49.71 58.00 51.20 56.21
qwen1.5-4b-hf 65.04 58.60 55.62 83.52 94.20 62.00 63.89 65.33 65.71 64.00 55.26 61.40
qwen1.5-7b-hf 78.57 66.51 66.85 87.91 94.69 68.07 65.00 64.82 77.14 77.50 60.77 74.49
qwen1.5-14b-hf 83.08 72.09 70.22 90.11 94.20 69.46 73.89 70.35 82.29 83.00 65.31 78.33
qwen1.5-32b-hf 87.59 78.14 79.78 92.86 95.65 78.32 80.56 79.90 90.29 89.00 77.27 86.68
qwen1.5-72b-hf 91.35 76.74 79.21 91.76 96.14 79.25 85.56 86.93 92.00 90.00 75.84 86.91
qwen1.5-moe-a2-7b-hf 88.35 75.81 51.12 79.12 94.69 67.37 80.56 73.37 87.43 84.00 78.23 82.39
mistral-7b-v0.1-hf 40.23 39.07 24.16 41.21 52.17 41.49 45.00 52.26 45.14 42.00 42.58 44.02
mistral-7b-v0.2-hf 36.84 34.88 23.03 43.96 52.66 40.79 50.00 50.75 45.14 40.50 42.58 40.86
mixtral-8x7b-v0.1-hf 47.74 40.00 28.09 57.14 58.94 44.29 58.33 53.77 48.57 46.00 51.20 46.50
mixtral-8x22b-v0.1-hf 59.02 41.86 29.78 60.99 71.01 50.82 57.78 67.34 62.29 52.00 53.35 55.98
yi-6b-hf 85.34 67.91 53.93 80.22 91.79 65.97 72.22 72.36 82.29 84.50 69.86 71.56
yi-34b-hf 94.36 76.74 65.73 87.91 95.17 79.25 85.56 90.95 90.86 92.00 76.79 82.39
deepseek-7b-base-hf 65.79 29.30 32.58 47.80 67.15 37.76 44.44 52.26 43.43 36.50 41.15 37.02
deepseek-67b-base-hf 83.83 58.60 45.51 79.67 90.34 62.47 70.56 70.85 81.14 71.50 61.72 60.05
model fire_engineer environmental_impact_assessment_engineer tax_accountant physician
llama-7b-turbomind 22.34 24.91 29.12 27.77
llama-13b-turbomind 24.11 30.25 27.77 30.70
llama-30b-turbomind 28.72 31.67 31.83 36.57
llama-65b-turbomind 28.37 39.15 33.63 35.44
llama-2-7b-turbomind 22.70 24.91 25.51 29.80
llama-2-13b-turbomind 25.53 35.94 29.35 35.44
llama-2-70b-turbomind 36.52 52.67 36.12 52.60
llama-3-8b-turbomind 35.46 49.82 41.31 55.30
llama-3-70b-turbomind 48.58 64.41 52.60 75.40
internlm2-1.8b-turbomind 32.27 42.35 39.05 45.15
internlm2-7b-turbomind 46.81 55.16 47.63 67.27
internlm2-20b-turbomind 45.04 62.63 51.47 69.75
qwen-1.8b-turbomind 41.84 47.69 45.60 57.34
qwen-7b-turbomind 41.84 54.80 48.08 69.53
qwen-14b-turbomind 45.74 64.77 56.43 77.88
qwen-72b-turbomind 80.50 74.73 81.04 89.62
qwen1.5-0.5b-hf 39.36 41.28 38.37 48.08
qwen1.5-1.8b-hf 45.74 49.47 51.69 63.43
qwen1.5-4b-hf 50.35 51.60 58.69 75.17
qwen1.5-7b-hf 58.51 65.84 67.04 81.94
qwen1.5-14b-hf 63.83 67.26 72.23 87.36
qwen1.5-32b-hf 74.47 73.31 80.14 90.74
qwen1.5-72b-hf 79.79 75.09 81.04 90.07
qwen1.5-moe-a2-7b-hf 74.82 77.58 79.68 91.65
mistral-7b-v0.1-hf 32.27 45.91 37.70 50.56
mistral-7b-v0.2-hf 32.62 44.13 36.79 46.28
mixtral-8x7b-v0.1-hf 35.11 53.02 46.73 52.37
mixtral-8x22b-v0.1-hf 38.65 56.23 49.21 59.82
yi-6b-hf 67.38 68.68 69.53 83.07
yi-34b-hf 77.66 83.27 77.43 89.84
deepseek-7b-base-hf 30.50 38.79 35.67 46.28
deepseek-67b-base-hf 46.81 65.12 54.40 77.65

Details on Dev Split

Chat Models

model ceval-test ceval-test-hard ceval-test-stem ceval-test-social-science ceval-test-humanities ceval-test-other ceval-dev ceval-dev-hard ceval-dev-stem ceval-dev-social-science ceval-dev-humanities ceval-dev-other
qwen1.5-0.5b-chat-hf 36.88 28.83 34.49 43.46 37.35 34.76 38.58 33.90 33.63 43.81 41.79 39.59
qwen1.5-1.8b-chat-hf 55.17 38.21 50.63 70.26 56.04 48.82 55.93 37.60 50.31 67.59 60.90 50.59
qwen1.5-4b-chat-hf 61.54 44.79 56.86 75.84 62.13 56.46 62.76 38.32 55.39 79.53 65.67 58.00
qwen1.5-7b-chat-hf 68.71 51.77 64.27 81.23 68.22 65.88 71.10 50.13 65.42 83.99 73.77 67.02
qwen1.5-14b-chat-hf 74.80 56.54 69.46 87.47 76.46 71.32 76.35 52.08 69.68 86.70 80.56 74.87
qwen1.5-32b-chat-hf 80.47 63.17 75.66 89.58 81.98 79.43 81.27 63.51 76.64 89.39 82.97 80.59
qwen1.5-72b-chat-hf 81.53 63.62 75.86 90.74 83.18 81.84 82.88 62.44 77.54 89.80 86.11 83.07
qwen1.5-110b-chat-hf 87.33 67.27 80.70 93.58 89.67 91.35 87.59 73.64 81.94 91.47 92.12 89.80
internlm2-chat-1.8b-hf 47.04 34.81 43.28 59.34 48.24 41.50 48.51 36.75 42.23 57.79 54.83 45.15
internlm2-chat-1.8b-sft-hf 47.19 35.34 43.49 59.56 48.30 41.58 48.75 35.83 42.04 59.80 54.84 44.83
internlm2-chat-7b-hf 58.75 39.61 52.38 71.46 61.57 55.96 61.04 36.56 51.81 74.01 69.13 57.92
internlm2-chat-7b-sft-hf 58.96 40.09 52.40 71.49 62.20 56.26 61.02 37.29 52.60 74.01 68.27 57.27
internlm2-chat-20b-hf 63.12 42.65 56.21 75.64 67.15 60.27 63.45 34.96 52.84 79.27 71.50 60.32
internlm2-chat-20b-sft-hf 63.16 42.70 56.19 75.74 67.20 60.37 63.54 34.96 52.57 80.33 71.42 60.34
llama-3-8b-instruct-hf 50.90 34.54 46.73 58.73 49.24 53.04 52.55 36.37 48.47 58.03 53.26 54.26
llama-3-70b-instruct-hf 67.38 54.02 65.16 76.83 62.29 67.92 67.92 54.50 66.85 76.80 65.98 63.72
llama-3-8b-instruct-lmdeploy 49.92 34.75 46.19 58.49 47.68 51.14 50.27 33.32 46.25 56.93 49.02 52.76
llama-3-70b-instruct-lmdeploy 66.41 52.76 64.72 75.31 61.36 66.44 68.21 52.28 65.86 75.06 68.37 66.09
mistral-7b-instruct-v0.1-hf 36.76 27.76 35.55 42.41 34.45 36.12 40.04 30.21 35.77 45.15 40.99 42.22
mistral-7b-instruct-v0.2-hf 40.38 30.26 38.82 47.66 37.08 39.91 43.00 25.97 38.60 47.44 48.15 41.82
mixtral-8x7b-instruct-v0.1-hf 49.61 37.78 47.86 58.56 46.40 47.85 51.68 37.41 49.14 59.79 52.97 47.65

Details on Test Split

model computer_network operating_system computer_architecture college_programming college_physics college_chemistry advanced_mathematics probability_and_statistics discrete_mathematics electrical_engineer metrology_engineer high_school_mathematics
qwen1.5-0.5b-chat-hf 35.67 36.87 33.68 33.92 35.23 28.12 27.17 26.51 24.84 28.91 40.18 25.90
qwen1.5-1.8b-chat-hf 46.78 47.49 50.78 39.18 41.48 31.25 32.95 27.71 28.10 34.81 55.71 27.11
qwen1.5-4b-chat-hf 54.39 54.75 54.92 44.74 46.02 43.30 39.31 31.33 28.10 45.13 58.90 43.98
qwen1.5-7b-chat-hf 60.82 60.34 63.21 55.85 48.86 45.09 46.24 36.14 39.22 47.49 70.32 45.78
qwen1.5-14b-chat-hf 69.59 62.57 64.77 64.91 55.68 57.14 49.13 32.53 43.14 55.16 76.71 46.99
qwen1.5-32b-chat-hf 81.87 74.30 73.58 71.35 63.07 60.71 50.87 46.99 47.06 59.29 83.11 60.84
qwen1.5-72b-chat-hf 77.78 75.42 76.17 73.39 63.64 62.50 45.09 45.78 48.37 59.00 81.74 60.84
qwen1.5-110b-chat-hf 83.63 86.03 81.87 77.49 76.70 67.86 49.13 47.59 55.56 79.94 95.89 62.05
internlm2-chat-1.8b-hf 42.11 43.58 44.56 35.38 32.95 34.82 32.95 28.92 32.68 34.22 53.42 31.93
internlm2-chat-1.8b-sft-hf 42.11 44.13 43.01 35.09 34.09 36.16 32.95 27.11 33.33 35.10 51.14 33.13
internlm2-chat-7b-hf 59.65 60.89 58.03 51.46 36.93 43.75 36.99 29.52 36.60 39.82 63.47 38.55
internlm2-chat-7b-sft-hf 59.06 61.45 56.48 52.63 39.77 41.52 36.99 27.71 39.22 40.12 62.10 40.36
internlm2-chat-20b-hf 61.99 70.39 63.73 54.97 33.52 47.77 43.93 40.96 44.44 44.25 61.64 34.34
internlm2-chat-20b-sft-hf 61.40 70.39 63.21 54.97 32.95 47.77 42.20 42.17 43.14 44.25 61.64 32.53
llama-3-8b-instruct-hf 57.31 58.10 57.51 51.17 28.41 35.27 39.31 32.53 35.29 38.05 55.25 27.11
llama-3-70b-instruct-hf 71.93 74.86 70.98 67.54 50.57 57.14 52.60 53.01 56.21 47.79 68.95 43.98
llama-3-8b-instruct-lmdeploy 55.56 57.54 55.44 48.25 30.11 33.04 35.84 31.33 33.33 38.94 53.88 31.93
llama-3-70b-instruct-lmdeploy 70.76 77.09 69.95 67.84 49.43 54.02 50.87 54.22 56.21 47.20 69.86 42.17
mistral-7b-instruct-v0.1-hf 49.12 47.49 43.52 39.18 32.39 28.57 29.48 24.10 28.10 37.46 44.29 23.49
mistral-7b-instruct-v0.2-hf 47.95 53.07 52.85 42.69 28.41 26.79 40.46 30.12 29.41 33.33 42.92 24.10
mixtral-8x7b-instruct-v0.1-hf 58.48 62.57 58.03 56.43 38.64 36.16 39.31 34.94 37.91 34.81 55.71 28.31
model high_school_physics high_school_chemistry high_school_biology middle_school_mathematics middle_school_biology middle_school_physics middle_school_chemistry veterinary_medicine college_economics business_administration marxism mao_zedong_thought
qwen1.5-0.5b-chat-hf 30.86 31.98 44.00 27.68 47.40 40.45 55.14 35.24 32.80 30.56 58.66 57.53
qwen1.5-1.8b-chat-hf 54.86 62.21 69.14 53.67 82.81 83.15 85.41 58.10 44.06 49.83 82.12 82.65
qwen1.5-4b-chat-hf 58.86 67.44 80.00 55.93 89.58 88.20 88.11 64.29 47.08 57.48 86.59 84.93
qwen1.5-7b-chat-hf 72.00 80.81 84.00 70.06 95.31 94.94 95.14 73.81 56.94 66.11 91.62 89.04
qwen1.5-14b-chat-hf 84.00 83.72 90.29 80.23 97.92 94.94 98.38 81.43 63.18 74.75 93.30 96.80
qwen1.5-32b-chat-hf 85.71 90.12 93.71 85.31 97.92 98.31 100.00 89.05 69.82 75.75 93.85 97.72
qwen1.5-72b-chat-hf 88.57 94.19 94.86 85.31 97.92 97.75 98.38 90.48 71.63 79.73 93.85 97.72
qwen1.5-110b-chat-hf 86.86 92.44 94.29 85.31 98.44 98.88 98.92 95.24 78.87 86.38 95.53 99.54
internlm2-chat-1.8b-hf 35.43 48.84 52.00 35.03 70.31 67.98 67.03 41.43 37.83 36.88 70.95 60.73
internlm2-chat-1.8b-sft-hf 37.71 48.26 53.14 34.46 71.35 67.98 67.57 41.90 38.63 37.54 72.63 60.27
internlm2-chat-7b-hf 46.29 48.26 60.57 46.89 78.65 71.91 71.35 68.10 50.30 50.83 77.09 76.26
internlm2-chat-7b-sft-hf 46.86 48.26 61.14 45.76 77.60 71.91 71.35 67.62 50.10 50.50 77.09 75.80
internlm2-chat-20b-hf 49.71 46.51 63.43 55.37 80.73 74.72 79.46 72.38 55.73 59.80 85.47 76.26
internlm2-chat-20b-sft-hf 53.71 47.09 64.00 55.37 80.73 73.60 78.92 73.81 55.53 60.13 85.47 75.80
llama-3-8b-instruct-hf 38.86 39.53 50.29 40.11 65.10 60.11 63.78 61.43 47.89 45.85 69.27 56.16
llama-3-70b-instruct-hf 63.43 55.23 69.71 68.36 85.42 80.90 78.38 86.19 69.01 65.12 83.24 82.65
llama-3-8b-instruct-lmdeploy 41.71 40.70 52.00 41.24 61.46 58.43 65.41 57.62 45.27 46.18 69.27 55.71
llama-3-70b-instruct-lmdeploy 61.71 53.49 70.86 64.97 88.02 83.71 77.30 84.76 68.21 60.80 80.45 79.91
mistral-7b-instruct-v0.1-hf 27.43 28.49 36.00 28.25 40.10 42.70 43.78 37.14 32.80 37.87 41.90 48.86
mistral-7b-instruct-v0.2-hf 33.14 29.65 44.00 31.07 47.92 44.94 49.19 44.29 37.02 40.86 53.63 48.40
mixtral-8x7b-instruct-v0.1-hf 46.29 40.70 54.86 42.37 58.85 60.67 57.84 54.29 50.10 46.51 69.27 52.51
model education_science teacher_qualification high_school_politics high_school_geography middle_school_politics middle_school_geography modern_chinese_history ideological_and_moral_cultivation logic law chinese_language_and_literature art_studies
qwen1.5-0.5b-chat-hf 33.33 46.12 37.50 37.08 57.51 43.52 42.45 51.74 32.84 31.22 37.32 24.50
qwen1.5-1.8b-chat-hf 54.07 72.43 74.43 66.85 89.12 87.04 77.36 76.16 38.24 44.34 46.89 40.94
qwen1.5-4b-chat-hf 60.00 84.71 82.39 69.66 94.82 90.74 79.72 78.49 41.67 57.47 54.07 56.38
qwen1.5-7b-chat-hf 66.30 90.73 84.66 80.90 94.30 91.67 82.55 84.88 38.73 60.18 60.77 63.42
qwen1.5-14b-chat-hf 74.81 93.73 90.91 92.13 96.89 98.15 89.62 88.37 54.41 70.14 69.86 69.13
qwen1.5-32b-chat-hf 80.37 94.49 93.75 94.94 97.93 97.22 90.09 90.70 68.63 78.73 73.21 77.52
qwen1.5-72b-chat-hf 84.07 96.74 95.45 94.94 97.93 95.37 92.92 91.28 63.73 80.09 73.68 83.89
qwen1.5-110b-chat-hf 90.37 96.99 96.02 95.51 98.45 98.15 93.87 94.19 81.37 86.88 84.69 90.94
internlm2-chat-1.8b-hf 48.15 65.41 69.32 54.49 79.27 70.37 60.85 64.53 32.35 32.58 45.45 40.60
internlm2-chat-1.8b-sft-hf 48.15 64.91 69.89 53.93 79.27 70.37 61.32 63.95 33.82 29.86 45.45 39.93
internlm2-chat-7b-hf 66.67 85.21 73.30 66.85 91.19 76.85 70.28 75.58 42.16 50.68 60.77 70.47
internlm2-chat-7b-sft-hf 67.04 85.21 73.86 66.85 90.67 77.78 71.70 75.00 42.16 51.13 60.29 72.15
internlm2-chat-20b-hf 74.07 85.96 75.57 77.53 89.12 76.85 72.64 83.72 51.96 56.11 68.42 73.49
internlm2-chat-20b-sft-hf 73.70 85.46 76.70 78.09 89.64 76.85 72.17 84.88 50.00 56.56 66.99 75.17
llama-3-8b-instruct-hf 55.93 67.42 55.68 55.06 72.02 62.04 54.25 66.86 44.12 40.72 47.37 44.63
llama-3-70b-instruct-hf 71.11 84.21 74.43 73.03 84.97 80.56 69.81 78.49 57.35 50.68 57.89 64.43
llama-3-8b-instruct-lmdeploy 54.81 67.17 58.52 53.37 72.54 62.04 57.08 63.95 44.12 37.56 46.89 42.62
llama-3-70b-instruct-lmdeploy 70.37 82.96 72.16 71.91 83.94 82.41 69.34 77.91 55.39 50.68 56.46 64.09
mistral-7b-instruct-v0.1-hf 39.63 46.62 33.52 41.01 56.48 45.37 36.32 43.60 29.90 31.67 39.71 31.88
mistral-7b-instruct-v0.2-hf 46.30 54.39 39.20 43.26 61.66 51.85 35.38 55.23 28.92 35.29 37.80 29.19
mixtral-8x7b-instruct-v0.1-hf 58.52 66.17 56.82 57.30 66.32 62.04 48.11 66.28 41.67 37.10 46.41 35.91
model professional_tour_guide legal_professional high_school_chinese high_school_history middle_school_history civil_servant sports_science plant_protection basic_medicine clinical_medicine urban_and_rural_planner accountant
qwen1.5-0.5b-chat-hf 36.47 39.07 27.53 41.76 45.89 39.63 35.56 31.66 37.71 34.00 32.78 37.25
qwen1.5-1.8b-chat-hf 56.02 45.58 39.33 67.03 84.54 49.42 48.89 51.76 47.43 50.50 45.69 52.14
qwen1.5-4b-chat-hf 61.28 52.56 42.70 73.08 85.99 55.48 59.44 55.28 60.57 57.00 50.00 58.01
qwen1.5-7b-chat-hf 73.31 56.28 58.99 82.97 88.41 64.57 66.67 63.82 77.14 75.50 57.42 69.07
qwen1.5-14b-chat-hf 80.83 65.12 70.79 89.56 93.24 67.60 72.78 68.34 80.57 80.00 61.72 75.62
qwen1.5-32b-chat-hf 87.59 72.56 76.40 90.66 95.65 74.36 80.00 80.40 86.86 84.00 74.88 85.33
qwen1.5-72b-chat-hf 90.98 76.28 75.84 90.66 95.65 75.52 84.44 82.91 91.43 89.00 73.92 85.10
qwen1.5-110b-chat-hf 95.11 88.37 82.58 91.76 96.62 87.65 91.67 90.95 93.71 95.00 87.08 91.87
internlm2-chat-1.8b-hf 54.14 40.00 27.53 62.09 70.53 44.99 41.67 51.76 45.71 39.00 40.67 39.28
internlm2-chat-1.8b-sft-hf 54.14 42.33 26.97 61.54 71.98 45.45 41.67 50.25 45.14 37.50 41.39 40.63
internlm2-chat-7b-hf 70.68 44.19 34.83 73.63 84.06 51.98 57.22 68.34 66.86 57.50 54.55 50.11
internlm2-chat-7b-sft-hf 71.80 44.65 37.64 73.63 84.06 51.98 57.78 67.84 65.71 60.50 54.55 50.11
internlm2-chat-20b-hf 75.56 54.42 42.13 74.73 85.51 57.34 65.56 67.84 73.71 64.00 57.89 55.98
internlm2-chat-20b-sft-hf 76.32 55.35 41.01 75.27 85.51 58.28 65.56 67.34 72.57 65.00 58.37 56.43
llama-3-8b-instruct-hf 53.01 44.65 33.15 46.70 66.18 45.22 58.89 61.81 62.86 57.50 48.33 49.89
llama-3-70b-instruct-hf 71.43 50.70 30.90 71.43 82.13 59.67 73.33 73.37 82.86 82.00 59.09 62.08
llama-3-8b-instruct-lmdeploy 51.13 45.12 29.78 43.96 62.32 47.09 56.11 54.77 56.00 56.00 49.04 47.40
llama-3-70b-instruct-lmdeploy 68.80 48.84 30.90 70.88 81.64 58.28 72.22 70.85 80.00 81.00 57.66 62.53
mistral-7b-instruct-v0.1-hf 30.45 35.81 24.72 40.11 34.78 30.77 43.89 38.69 36.57 32.50 44.74 34.09
mistral-7b-instruct-v0.2-hf 36.09 38.14 23.03 43.41 45.41 35.90 50.00 41.71 42.86 36.00 45.22 42.21
mixtral-8x7b-instruct-v0.1-hf 47.37 44.65 30.34 51.65 60.87 42.19 53.89 58.29 52.00 47.00 48.56 44.02
model fire_engineer environmental_impact_assessment_engineer tax_accountant physician
qwen1.5-0.5b-chat-hf 27.66 38.43 32.28 35.44
qwen1.5-1.8b-chat-hf 38.65 46.62 46.73 59.14
qwen1.5-4b-chat-hf 49.29 54.80 51.02 70.20
qwen1.5-7b-chat-hf 53.90 62.28 57.79 76.52
qwen1.5-14b-chat-hf 58.87 65.12 67.27 86.68
qwen1.5-32b-chat-hf 74.11 70.82 74.94 88.04
qwen1.5-72b-chat-hf 74.82 75.09 78.56 89.39
qwen1.5-110b-chat-hf 88.30 88.97 94.13 95.49
internlm2-chat-1.8b-hf 30.14 41.99 34.54 46.73
internlm2-chat-1.8b-sft-hf 30.14 43.06 34.31 47.86
internlm2-chat-7b-hf 42.20 52.31 47.63 66.82
internlm2-chat-7b-sft-hf 43.26 52.67 47.86 66.59
internlm2-chat-20b-hf 45.74 54.80 51.02 69.07
internlm2-chat-20b-sft-hf 45.74 55.16 51.02 68.62
llama-3-8b-instruct-hf 37.59 50.53 42.44 68.40
llama-3-70b-instruct-hf 50.71 64.06 55.53 84.42
llama-3-8b-instruct-lmdeploy 37.94 50.53 41.53 66.14
llama-3-70b-instruct-lmdeploy 48.94 63.70 53.95 81.72
mistral-7b-instruct-v0.1-hf 27.66 39.15 29.35 39.95
mistral-7b-instruct-v0.2-hf 32.27 37.01 32.96 42.89
mixtral-8x7b-instruct-v0.1-hf 36.88 48.75 41.76 53.05

Details on Dev Split