Merge branch 'main' into qwq32b

This commit is contained in:
Linchen Xiao 2025-03-24 11:30:28 +08:00 committed by GitHub
commit d8b056cd77
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
50 changed files with 3380 additions and 56 deletions

View File

@ -57,6 +57,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2025.03.11\]** We have supported evaluation for `SuperGPQA` which is a great benchmark for measuring LLM knowledge ability 🔥🔥🔥
- **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
- **\[2025.01.16\]** We now support the [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) model which has enhanced performance on reasoning and knowledge-intensive tasks.

View File

@ -57,6 +57,7 @@
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2025.03.11\]** 现已支持 `SuperGPQA` 覆盖285 个研究生学科的知识能力评测,欢迎尝试!🔥🔥🔥
- **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程,请查看 [评估推理模型](docs/en/user_guides/deepseek_r1.md) 了解更多详情!🔥🔥🔥
- **\[2025.02.15\]** 我们新增了两个实用的评测工具用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情!🔥🔥🔥
- **\[2025.01.16\]** 我们现已支持 [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) 模型,该模型在推理、知识类任务上取得同量级最优性能,欢迎尝试。

View File

@ -234,6 +234,11 @@
category: Reasoning
paper: https://arxiv.org/pdf/2210.09261
configpath: opencompass/configs/datasets/bbh
- bbeh:
name: BIG-Bench Extra Hard
category: Reasoning
paper: https://arxiv.org/abs/2502.19187
configpath: opencompass/configs/datasets/bbeh
- BoolQ:
name: SuperGLUE / BoolQ
category: Knowledge
@ -524,6 +529,11 @@
category: Understanding
paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
configpath: opencompass/configs/datasets/SuperGLUE_MultiRC
- multipl_e:
name: MultiPL-E
category: Code
paper: https://arxiv.org/pdf/2210.14868
configpath: opencompass/configs/datasets/multipl_e
- narrativeqa:
name: NarrativeQA
category: Understanding
@ -734,6 +744,8 @@
category: Understanding
paper: https://arxiv.org/pdf/1808.08745
configpath: opencompass/configs/datasets/Xsum
- supergpqa:
name: SuperGPQA
category: Knowledge
paper: https://arxiv.org/pdf/2502.14739
configpath: opencompass/configs/datasets/supergpqa

View File

@ -34,6 +34,23 @@ problem,answer
## Configuration
### Using LLM for Evaluation via Command Line
Some datasets in OpenCompass already include LLM judge configurations.
You need to use a model service (such as OpenAI or DeepSeek's official API) or start a model service locally using tools like LMDeploy, vLLM, or SGLang.
Then, you can set the environment variables for the evaluation service and evaluate models using the following commands:
```bash
export OC_JUDGE_MODEL=Qwen/Qwen2.5-32B-Instruct
export OC_JUDGE_API_KEY=sk-1234
export OC_JUDGE_API_BASE=http://172.30.56.1:4000/v1
```
Note that by default, OpenCompass will use these three environment variables, but if you use configuration files to configure the evaluation service, these environment variables will not take effect.
### ### Using LLM for Evaluation via Configuration Files
To set up an LLM judge evaluation, you'll need to configure three main components:
1. Dataset Reader Configuration

View File

@ -34,7 +34,24 @@ problem,answer
## 配置说明
要设置LLM评判评估你需要配置三个主要组件
### 基于命令行使用LLM进行评估
OpenCompass中部分数据集已经包含了LLM评判器的配置。
你需要使用一个模型服务如OpenAI或DeepSeek官方提供的API或本地使用LMDeploy、vLLM、SGLang等工具启动一个模型服务。
然后,你可以通过以下命令设置相关评估服务的环境变量,并对模型进行评估:
```bash
export OC_JUDGE_MODEL=Qwen/Qwen2.5-32B-Instruct
export OC_JUDGE_API_KEY=sk-1234
export OC_JUDGE_API_BASE=http://172.30.56.1:4000/v1
```
注意默认情况下OpenCompass会使用这三个环境变量但如果你使用了基于配置文件的方式配置评估服务这三个环境变量将不会生效。
### 基于配置文件使用LLM进行评估
对一个数据集设置LLM评判评估你需要配置三个主要组件
1. 数据集读取配置

View File

@ -0,0 +1,90 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import CustomDataset
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
aime2024_reader_cfg = dict(input_columns=['question'], output_column='answer')
aime2024_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{question}\nRemember to put your final answer within \\boxed{}.',
),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{question}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
aime2024_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=CustomDataset,
path='opencompass/aime2025',
reader_cfg=aime2024_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
)
)
aime2024_datasets = [
dict(
abbr='aime2024',
type=CustomDataset,
path='opencompass/aime2025',
reader_cfg=aime2024_reader_cfg,
infer_cfg=aime2024_infer_cfg,
eval_cfg=aime2024_eval_cfg,
)
]

View File

@ -0,0 +1,90 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import CustomDataset
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
aime2025_reader_cfg = dict(input_columns=['question'], output_column='answer')
aime2025_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{question}\nRemember to put your final answer within \\boxed{}.',
),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{question}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
aime2025_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=CustomDataset,
path='opencompass/aime2025',
reader_cfg=aime2025_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
)
aime2025_datasets = [
dict(
type=CustomDataset,
abbr='aime2025',
path='opencompass/aime2025',
reader_cfg=aime2025_reader_cfg,
infer_cfg=aime2025_infer_cfg,
eval_cfg=aime2025_eval_cfg,
)
]

View File

@ -0,0 +1,26 @@
# BB#H
```bash
python3 run.py --models hf_internlm2_7b --datasets bbeh_gen --debug
python3 run.py --models hf_meta_llama3_8b_instruct --datasets bbeh_gen --debug
```
## Models
| model | score |
|:-----------------------------------------:|------:|
| Meta-Llama-3-8B-Instruct-LMDeploy-API | 10.93 |
### Details
| model | boolean_expressions | disambiguation_qa | geometric_shapes | hyperbaton | movie_recommendation | nycc | shuffled_objects | boardgame_qa |
|:-----------------------------------------:|--------------------:|------------------:|-----------------:|-----------:|---------------------:|-----:|-----------------:|-------------:|
| Meta-Llama-3-8B-Instruct-LMDeploy-API | 14.00 | 33.33 | 13.50 | 1.00 | 28.00 | 11.00 | 10.00 | 18.50 |
| model | buggy_tables | causal_understanding | dyck_languages | linguini | multistep_arithmetic | object_counting | object_properties | sarc_triples |
|:-----------------------------------------:|-------------:|---------------------:|---------------:|---------:|---------------------:|----------------:|------------------:|-------------:|
| Meta-Llama-3-8B-Instruct-LMDeploy-API | 0.00 | 42.50 | 3.50 | 2.00 | 0.00 | 0.00 | 1.00 | 17.00 |
| model | spatial_reasoning | sportqa | temporal_sequence | time_arithmetic | web_of_lies | word_sorting | zebra_puzzles |
|:-----------------------------------------:|------------------:|-------:|-----------------:|----------------:|------------:|-------------:|--------------:|
| Meta-Llama-3-8B-Instruct-LMDeploy-API | 4.00 | 5.00 | 2.00 | 3.00 | 7.50 | 2.00 | 3.50 |

View File

@ -0,0 +1,93 @@
import os
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import BBEHDataset, BBEHEvaluator, bbeh_mcq_postprocess, BBEHEvaluator_mcq
bbeh_reader_cfg = dict(input_columns=['input'], output_column='target')
bbeh_multiple_choice_sets = [
'bbeh_boolean_expressions',
'bbeh_disambiguation_qa',
'bbeh_geometric_shapes',
'bbeh_hyperbaton',
'bbeh_movie_recommendation',
'bbeh_nycc',
'bbeh_shuffled_objects',
]
bbeh_free_form_sets = [
'bbeh_boardgame_qa',
'bbeh_buggy_tables',
'bbeh_causal_understanding',
'bbeh_dyck_languages',
'bbeh_linguini',
'bbeh_multistep_arithmetic',
'bbeh_object_counting',
'bbeh_object_properties',
'bbeh_sarc_triples',
'bbeh_spatial_reasoning',
'bbeh_sportqa',
'bbeh_temporal_sequence',
'bbeh_time_arithmetic',
'bbeh_web_of_lies',
'bbeh_word_sorting',
'bbeh_zebra_puzzles',
]
bbeh_datasets = []
for _name in bbeh_multiple_choice_sets:
bbeh_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=
f"Think step by step, and when you provide the final answer, please use the prefix \"The answer is:\"without any modification, and provide the answer directly, with no formatting, no bolding, and no markup. For instance: \"The answer is: 42\" or \"The answer is: yes\". If the question is multiple choice with a single correct answer, the final answer must only be the letter corresponding to the correct answer. For example, \"The answer is: (a)\"\n\nQ: {{input}}\nA: "
)
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=8192))
bbeh_eval_cfg = dict(
evaluator=dict(type=BBEHEvaluator_mcq),
pred_role='BOT',
pred_postprocessor=dict(type=bbeh_mcq_postprocess),
dataset_postprocessor=dict(type=bbeh_mcq_postprocess))
bbeh_datasets.append(
dict(
type=BBEHDataset,
path='opencompass/bbeh',
name=_name,
abbr=_name,
reader_cfg=bbeh_reader_cfg,
infer_cfg=bbeh_infer_cfg.copy(),
eval_cfg=bbeh_eval_cfg.copy()))
for _name in bbeh_free_form_sets:
bbeh_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=
f"Think step by step, and when you provide the final answer, please use the prefix \"The answer is:\"without any modification, and provide the answer directly, with no formatting, no bolding, and no markup. For instance: \"The answer is: 42\" or \"The answer is: yes\". If the question is multiple choice with a single correct answer, the final answer must only be the letter corresponding to the correct answer. For example, \"The answer is: (a)\"\n\nQ: {{input}}\nA: "
)
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=8192))
bbeh_eval_cfg = dict(evaluator=dict(type=BBEHEvaluator), pred_role='BOT', pred_postprocessor=dict(type=bbeh_mcq_postprocess), dataset_postprocessor=dict(type=bbeh_mcq_postprocess))
bbeh_datasets.append(
dict(
type=BBEHDataset,
path='opencompass/bbeh',
name=_name,
abbr=_name,
reader_cfg=bbeh_reader_cfg,
infer_cfg=bbeh_infer_cfg.copy(),
eval_cfg=bbeh_eval_cfg.copy()))

View File

@ -0,0 +1,126 @@
import os
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import (
BBEHDataset,
generic_llmjudge_postprocess,
)
from opencompass.evaluator import GenericLLMEvaluator
bbeh_reader_cfg = dict(input_columns=['input'], output_column='target')
bbeh_multiple_choice_sets = [
'bbeh_boolean_expressions',
'bbeh_disambiguation_qa',
'bbeh_geometric_shapes',
'bbeh_hyperbaton',
'bbeh_movie_recommendation',
'bbeh_nycc',
'bbeh_shuffled_objects',
]
bbeh_free_form_sets = [
'bbeh_boardgame_qa',
'bbeh_buggy_tables',
'bbeh_causal_understanding',
'bbeh_dyck_languages',
'bbeh_linguini',
'bbeh_multistep_arithmetic',
'bbeh_object_counting',
'bbeh_object_properties',
'bbeh_sarc_triples',
'bbeh_spatial_reasoning',
'bbeh_sportqa',
'bbeh_temporal_sequence',
'bbeh_time_arithmetic',
'bbeh_web_of_lies',
'bbeh_word_sorting',
'bbeh_zebra_puzzles',
]
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{input}\n<Original Question End>\n\n
<Gold Target Begin>: \n{target}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
bbeh_datasets = []
for _name in bbeh_multiple_choice_sets + bbeh_free_form_sets:
bbeh_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt=f"Think step by step, and when you provide the final answer, please use the prefix \"The answer is:\"without any modification, and provide the answer directly, with no formatting, no bolding, and no markup. For instance: \"The answer is: 42\" or \"The answer is: yes\". If the question is multiple choice with a single correct answer, the final answer must only be the letter corresponding to the correct answer. For example, \"The answer is: (a)\"\n\nQ: {{input}}\nA: ",
)
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
bbeh_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=BBEHDataset,
path='opencompass/bbeh',
name=_name,
abbr=_name,
reader_cfg=bbeh_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
)
bbeh_datasets.append(
dict(
type=BBEHDataset,
path='opencompass/bbeh',
name=_name,
abbr=_name,
reader_cfg=bbeh_reader_cfg,
infer_cfg=bbeh_infer_cfg,
eval_cfg=bbeh_eval_cfg,
)
)

View File

@ -0,0 +1,185 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import CMMLUDataset
from opencompass.utils.text_postprocessors import match_answer_pattern
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
cmmlu_subject_mapping = {
'agronomy': '农学',
'anatomy': '解剖学',
'ancient_chinese': '古汉语',
'arts': '艺术学',
'astronomy': '天文学',
'business_ethics': '商业伦理',
'chinese_civil_service_exam': '中国公务员考试',
'chinese_driving_rule': '中国驾驶规则',
'chinese_food_culture': '中国饮食文化',
'chinese_foreign_policy': '中国外交政策',
'chinese_history': '中国历史',
'chinese_literature': '中国文学',
'chinese_teacher_qualification': '中国教师资格',
'clinical_knowledge': '临床知识',
'college_actuarial_science': '大学精算学',
'college_education': '大学教育学',
'college_engineering_hydrology': '大学工程水文学',
'college_law': '大学法律',
'college_mathematics': '大学数学',
'college_medical_statistics': '大学医学统计',
'college_medicine': '大学医学',
'computer_science': '计算机科学',
'computer_security': '计算机安全',
'conceptual_physics': '概念物理学',
'construction_project_management': '建设工程管理',
'economics': '经济学',
'education': '教育学',
'electrical_engineering': '电气工程',
'elementary_chinese': '小学语文',
'elementary_commonsense': '小学常识',
'elementary_information_and_technology': '小学信息技术',
'elementary_mathematics': '初等数学',
'ethnology': '民族学',
'food_science': '食品科学',
'genetics': '遗传学',
'global_facts': '全球事实',
'high_school_biology': '高中生物',
'high_school_chemistry': '高中化学',
'high_school_geography': '高中地理',
'high_school_mathematics': '高中数学',
'high_school_physics': '高中物理学',
'high_school_politics': '高中政治',
'human_sexuality': '人类性行为',
'international_law': '国际法学',
'journalism': '新闻学',
'jurisprudence': '法理学',
'legal_and_moral_basis': '法律与道德基础',
'logical': '逻辑学',
'machine_learning': '机器学习',
'management': '管理学',
'marketing': '市场营销',
'marxist_theory': '马克思主义理论',
'modern_chinese': '现代汉语',
'nutrition': '营养学',
'philosophy': '哲学',
'professional_accounting': '专业会计',
'professional_law': '专业法学',
'professional_medicine': '专业医学',
'professional_psychology': '专业心理学',
'public_relations': '公共关系',
'security_study': '安全研究',
'sociology': '社会学',
'sports_science': '体育学',
'traditional_chinese_medicine': '中医中药',
'virology': '病毒学',
'world_history': '世界历史',
'world_religions': '世界宗教',
}
QUERY_TEMPLATE = """
你回答的最后一行**必须**是以下格式 '答案: $选项' (不带引号), 其中选项是ABCD之一.
{question}
A) {A}
B) {B}
C) {C}
D) {D}
""".strip()
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n {question}\n A) {A}\n B) {B}\n C) {C}\n D) {D}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
cmmlu_all_sets = list(cmmlu_subject_mapping.keys())
cmmlu_datasets = []
for _name in cmmlu_all_sets:
_ch_name = cmmlu_subject_mapping[_name]
prompt_prefix = f'请回答以下关于{_ch_name}的单项选择题, '
cmmlu_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt=prompt_prefix + QUERY_TEMPLATE),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
cmmlu_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=CMMLUDataset,
path='opencompass/cmmlu',
name=_name,
reader_cfg=dict(
input_columns=['question', 'A', 'B', 'C', 'D'],
output_column='answer',
train_split='dev',
test_split='test',
),
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
)
cmmlu_datasets.append(
dict(
type=CMMLUDataset,
path='opencompass/cmmlu',
name=_name,
abbr=f'cmmlu-{_name}',
reader_cfg=dict(
input_columns=['question', 'A', 'B', 'C', 'D'],
output_column='answer',
train_split='dev',
test_split='test',
),
infer_cfg=cmmlu_infer_cfg,
eval_cfg=cmmlu_eval_cfg,
mode='singlescore',
)
)
del _name, _ch_name

View File

@ -0,0 +1,89 @@
from mmengine.config import read_base
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import DropOpenAIDataset
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
with read_base():
from .drop_examples import drop_examples # noqa: F401, F403
drop_reader_cfg = dict(
input_columns=['prompt'],
output_column='answers',
train_split='validation',
test_split='validation',
)
template = f'You will be asked to read a passage and answer a question. Some examples of passages and Q&A are provided below.\n\n{drop_examples}\n\n# Your Task\n\n---\n{{prompt}}\n\nThink step by step, then write a line of the form "Answer: $ANSWER" at the end of your response.'
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: {prompt}\n \n<Original Question End>\n\n
<Gold Target Begin>: \n{answers}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
drop_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[dict(role='HUMAN', prompt=template)]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
drop_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=DropOpenAIDataset,
path='data/drop_simple_eval/dev.jsonl',
reader_cfg=drop_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
)
drop_datasets = [
dict(
abbr='drop',
type=DropOpenAIDataset,
path='data/drop_simple_eval/dev.jsonl',
reader_cfg=drop_reader_cfg,
infer_cfg=drop_infer_cfg,
eval_cfg=drop_eval_cfg,
)
]

View File

@ -0,0 +1,97 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccwithDetailsEvaluator
from opencompass.datasets import HellaswagDatasetwithICE
from opencompass.utils.text_postprocessors import first_option_postprocess
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
hellaswag_reader_cfg = dict(
input_columns=['ctx', 'A', 'B', 'C', 'D'],
output_column='label',
train_split='train',
test_split='val',
)
align_prompt = """Continue the following text without adding any additional information or formatting:
{ctx}
A) {A}
B) {B}
C) {C}
D) {D}
What is the right option?'"""
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: {ctx}\n A) {A}\n B) {B}\n C) {C}\n D) {D}\n<Original Question End>\n\n
<Gold Target Begin>: \n{label}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
hellaswag_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt=align_prompt),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
hellaswag_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=HellaswagDatasetwithICE,
path='opencompass/hellaswag_ice',
reader_cfg=hellaswag_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
)
hellaswag_datasets = [
dict(
abbr='hellaswag',
type=HellaswagDatasetwithICE,
path='opencompass/hellaswag_ice',
reader_cfg=hellaswag_reader_cfg,
infer_cfg=hellaswag_infer_cfg,
eval_cfg=hellaswag_eval_cfg,
)
]

View File

@ -0,0 +1,111 @@
from mmengine.config import read_base
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import MMLUDataset
from opencompass.utils.text_postprocessors import match_answer_pattern
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
with read_base():
from .mmlu_all_sets import mmlu_all_sets
# None of the mmlu dataset in huggingface is correctly parsed, so we use our own dataset reader
# Please download the dataset from https://people.eecs.berkeley.edu/~hendrycks/data.tar
QUERY_TEMPLATE = """
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of ABCD.
{input}
A) {A}
B) {B}
C) {C}
D) {D}
""".strip()
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: {input}\n A) {A}\n B) {B}\n C) {C}\n D) {D}\n<Original Question End>\n\n
<Gold Target Begin>: \n{target}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
mmlu_reader_cfg = dict(
input_columns=['input', 'A', 'B', 'C', 'D'],
output_column='target',
train_split='dev',
)
mmlu_datasets = []
for name in mmlu_all_sets:
mmlu_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt=QUERY_TEMPLATE),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
mmlu_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=MMLUDataset,
path='opencompass/mmlu',
name=name,
reader_cfg=mmlu_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
)
mmlu_datasets.append(
dict(
abbr=f'lukaemon_mmlu_{name}',
type=MMLUDataset,
path='opencompass/mmlu',
name=name,
reader_cfg=mmlu_reader_cfg,
infer_cfg=mmlu_infer_cfg,
eval_cfg=mmlu_eval_cfg,
mode='singlescore',
)
)

View File

@ -0,0 +1,106 @@
from mmengine.config import read_base
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import MMLUProDataset, generic_llmjudge_postprocess
with read_base():
from .mmlu_pro_categories import categories
QUERY_TEMPLATE = """
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of Options(e.g. one of ABCDEFGHIJKLMNOP). Think step by step before answering.
Question:\n
{question}
Options:\n
{options_str}
""".strip()
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: {question}\n {options_str} \n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
mmlu_pro_datasets = []
for category in categories:
mmlu_pro_reader_cfg = dict(
input_columns=['question', 'cot_content', 'options_str'],
output_column='answer',
train_split='validation',
test_split='test',
)
mmlu_pro_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt=QUERY_TEMPLATE),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
mmlu_pro_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=MMLUProDataset,
path='opencompass/mmlu_pro',
category=category,
reader_cfg=mmlu_pro_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
)
mmlu_pro_datasets.append(
dict(
abbr=f'mmlu_pro_{category.replace(" ", "_")}',
type=MMLUProDataset,
path='opencompass/mmlu_pro',
category=category,
reader_cfg=mmlu_pro_reader_cfg,
infer_cfg=mmlu_pro_infer_cfg,
eval_cfg=mmlu_pro_eval_cfg,
)
)

View File

@ -0,0 +1,56 @@
# Select the 10 most popular programming languages from MultiPL-E to compose the test set.
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import MultiplEDataset, MultiplEEvaluator
_TOP_TEN_LANGUAGE_ = ['cpp', 'cs', 'go', 'java', 'rb', 'js', 'php', 'r', 'rs', 'sh']
multiple_reader_cfg = dict(input_columns=['language', 'prompt'], output_column='tests')
multiple_infer_cfg = dict(
prompt_template=dict(type=PromptTemplate, template='Based on the provided {language} code snippet, complete the subsequent content. The initial part of the completed code must match the provided code snippet exactly:\n{prompt}'),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
multiple_eval_cfg = {
lang: dict(
evaluator=dict(
type=MultiplEEvaluator,
language=lang,
ip_address='https://opencompass-multiple-evaluator.hf.space',
),
pred_role='BOT',
) for lang in _TOP_TEN_LANGUAGE_
}
multiple_datasets = [
dict(
type=MultiplEDataset,
abbr=f'humaneval-multiple-{lang}',
language=lang,
num_repeats=1,
path='opencompass/multipl_e',
tag='humaneval',
reader_cfg=multiple_reader_cfg,
infer_cfg=multiple_infer_cfg,
eval_cfg=multiple_eval_cfg[lang],
) for lang in _TOP_TEN_LANGUAGE_
]
multiple_datasets += [
dict(
type=MultiplEDataset,
abbr=f'mbpp-multiple-{lang}',
language=lang,
num_repeats=1,
path='opencompass/multipl_e',
tag='mbpp',
reader_cfg=multiple_reader_cfg,
infer_cfg=multiple_infer_cfg,
eval_cfg=multiple_eval_cfg[lang],
) for lang in _TOP_TEN_LANGUAGE_
]

View File

@ -0,0 +1,131 @@
from opencompass.datasets import MusrDataset, generic_llmjudge_postprocess
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.openicl import PromptTemplate, ZeroRetriever, GenInferencer
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: {system_prompt}\n{prompt}\n<Original Question End>\n\n
<Gold Target Begin>: \n{gold_answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
# Common configuration components
reader_cfg = dict(
input_columns=[
'context',
'question_text',
'question',
'answer',
'choices',
'choices_str',
'intermediate_trees',
'intermediate_data',
'prompt',
'system_prompt',
'gold_answer',
'scidx',
'self_consistency_n',
'ablation_name',
],
output_column='gold_answer',
)
infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt='{system_prompt}',
)
],
round=[
dict(role='HUMAN', prompt='{prompt}'),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Dataset configurations
DATASET_CONFIGS = {
'murder_mysteries': {
'abbr': 'musr_murder_mysteries',
'name': 'murder_mysteries',
'path': 'opencompass/musr',
},
'object_placements': {
'abbr': 'musr_object_placements',
'name': 'object_placements',
'path': 'opencompass/musr',
},
'team_allocation': {
'abbr': 'musr_team_allocation',
'name': 'team_allocation',
'path': 'opencompass/musr',
},
}
# Create dataset configurations
musr_datasets = []
for config in DATASET_CONFIGS.values():
dataset = dict(
abbr=config['abbr'],
type=MusrDataset,
path=config['path'],
name=config['name'],
reader_cfg=reader_cfg,
infer_cfg=infer_cfg,
eval_cfg=dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=MusrDataset,
path=config['path'],
name=config['name'],
reader_cfg=reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
),
)
musr_datasets.append(dataset)

View File

@ -0,0 +1,57 @@
from opencompass.datasets.supergpqa.supergpqa import (
SuperGPQADataset,
SuperGPQAEvaluator,
)
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
# Reader configuration
reader_cfg = dict(
input_columns=[
'question',
'options',
'discipline',
'field',
'subfield',
'difficulty',
'infer_prompt',
'prompt_mode',
],
output_column='answer_letter',
)
# Inference configuration
infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{infer_prompt}',
),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Evaluation configuration
eval_cfg = dict(
evaluator=dict(type=SuperGPQAEvaluator),
pred_role='BOT',
)
supergpqa_dataset = dict(
type=SuperGPQADataset,
abbr='supergpqa',
path='m-a-p/SuperGPQA',
prompt_mode='zero-shot',
reader_cfg=reader_cfg,
infer_cfg=infer_cfg,
eval_cfg=eval_cfg,
)
supergpqa_datasets = [supergpqa_dataset]

View File

@ -0,0 +1,103 @@
from opencompass.datasets.supergpqa.supergpqa import (
SuperGPQADataset,
)
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: {infer_prompt}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer_letter}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
# Reader configuration
reader_cfg = dict(
input_columns=[
'question',
'options',
'discipline',
'field',
'subfield',
'difficulty',
'infer_prompt',
'prompt_mode',
],
output_column='answer_letter',
)
# Inference configuration
infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{infer_prompt}',
),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Evaluation configuration
eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=SuperGPQADataset,
path='m-a-p/SuperGPQA',
prompt_mode='zero-shot',
reader_cfg=reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
)
supergpqa_dataset = dict(
type=SuperGPQADataset,
abbr='supergpqa',
path='m-a-p/SuperGPQA',
prompt_mode='zero-shot',
reader_cfg=reader_cfg,
infer_cfg=infer_cfg,
eval_cfg=eval_cfg,
)
supergpqa_datasets = [supergpqa_dataset]

View File

@ -9,4 +9,4 @@ models = [
batch_size=8,
run_cfg=dict(num_gpus=2),
)
]
]

View File

@ -16,7 +16,17 @@ math_categories = [
'OE_TO_maths_zh_CEE', # OpenEnded - TextOnly - maths - CEE
]
physics_categories = [
'OE_TO_physics_en_COMP', # OpenEnded - TextOnly - physics - COMP
'OE_TO_physics_zh_CEE' # OpenEnded - TextOnly - physics - CEE
]
OlympiadBenchMath_summary_groups = [
{'name': 'OlympiadBenchMath', 'subsets': ['OlympiadBench_' + c.replace(' ', '_') for c in math_categories]},
]
OlympiadBenchPhysics_summary_groups = [
{'name': 'OlympiadBenchPhysics', 'subsets': ['OlympiadBench_' + c.replace(' ', '_') for c in physics_categories]},
]

View File

@ -0,0 +1,13 @@
bbeh_summary_groups = []
# bbeh
_bbeh = [
'bbeh_boolean_expressions', 'bbeh_disambiguation_qa', 'bbeh_geometric_shapes', 'bbeh_hyperbaton',
'bbeh_movie_recommendation', 'bbeh_nycc', 'bbeh_shuffled_objects', 'bbeh_boardgame_qa',
'bbeh_buggy_tables', 'bbeh_causal_understanding', 'bbeh_dyck_languages', 'bbeh_linguini',
'bbeh_multistep_arithmetic', 'bbeh_object_counting', 'bbeh_object_properties', 'bbeh_sarc_triples',
'bbeh_spatial_reasoning', 'bbeh_sportqa', 'bbeh_temporal_sequence', 'bbeh_time_arithmetic',
'bbeh_web_of_lies', 'bbeh_word_sorting', 'bbeh_zebra_puzzles'
]
bbeh_summary_groups.append({'name': 'bbeh', 'subsets': _bbeh, 'metric':'naive_average'})
bbeh_summary_groups.append({'name': 'bbeh', 'subsets': _bbeh, 'metric':'harmonic_mean'})

View File

@ -9,6 +9,7 @@ from .arc import * # noqa: F401, F403
from .arc_prize_public_evaluation import * # noqa: F401, F403
from .ax import * # noqa: F401, F403
from .babilong import * # noqa: F401, F403
from .bbeh import * # noqa: F401, F403
from .bbh import * # noqa: F401, F403
from .bigcodebench import * # noqa: F401, F403
from .boolq import * # noqa: F401, F403
@ -97,6 +98,7 @@ from .mmlu_cf import * # noqa: F401, F403
from .mmlu_pro import * # noqa: F401, F403
from .MMLUArabic import * # noqa: F401, F403
from .mmmlu import * # noqa: F401, F403
from .multipl_e import * # noqa: F401, F403
from .multirc import * # noqa: F401, F403
from .musr import * # noqa: F401, F403
from .narrativeqa import * # noqa: F401, F403
@ -127,6 +129,7 @@ from .strategyqa import * # noqa: F401, F403
from .subjective import * # noqa: F401, F403
from .summedits import * # noqa: F401, F403
from .summscreen import * # noqa: F401, F403
from .supergpqa import * # noqa: F401, F403
from .svamp import * # noqa: F401, F403
from .tabmwp import * # noqa: F401, F403
from .taco import * # noqa: F401, F403

View File

@ -0,0 +1,149 @@
import json
import os.path as osp
import re
from os import environ
from datasets import Dataset
from opencompass.openicl.icl_evaluator import BaseEvaluator
from opencompass.registry import (ICL_EVALUATORS, LOAD_DATASET,
TEXT_POSTPROCESSORS)
from opencompass.utils import get_data_path
from .base import BaseDataset
@LOAD_DATASET.register_module()
class BBEHDataset(BaseDataset):
@staticmethod
def load(path: str, name: str):
path = get_data_path(path)
if environ.get('DATASET_SOURCE') == 'ModelScope':
from modelscope import MsDataset
dataset = MsDataset.load(path, subset_name=name, split='test')
else:
with open(osp.join(path, f'{name}/task.json'), 'r') as f:
data = json.load(f)['examples']
dataset = Dataset.from_list(data)
return dataset
@TEXT_POSTPROCESSORS.register_module('bbeh_freeform')
def bbeh_freeform_postprocess(text: str) -> str:
# Extract answer using specified prefixes
prefixes = [
'The answer is: ', 'The answer is ', 'The final answer is: ',
'The final answer is '
]
answer = text
for prefix in prefixes:
if prefix in text:
answer = text.split(prefix)[-1]
break
# Remove formatting markup
if '\\boxed' in answer:
answer = re.sub(r'\\boxed{(.*?)}', r'\1', answer) # latex box
if '\\text' in answer:
answer = re.sub(r'\\text(?:tt)?{(.*?)}', r'\1', answer) # text/texttt
if '**' in answer:
answer = re.sub(r'\*\*(.*?)\*\*', r'\1', answer) # bold
# Take first line and clean
if '\n' in answer:
answer = answer.split('\n')[0].strip()
return answer.strip().lower()
@TEXT_POSTPROCESSORS.register_module('bbeh_mcq')
def bbeh_mcq_postprocess(text: str) -> str:
# Extract answer using specified prefixes
prefixes = [
'The answer is: ', 'The answer is ', 'The final answer is: ',
'The final answer is '
]
answer = text
for prefix in prefixes:
if prefix in text:
answer = text.split(prefix)[-1]
break
# Remove parentheses if present
answer = answer.strip('()')
# Take first line and clean
if '\n' in answer:
answer = answer.split('\n')[0].strip()
return answer.strip().lower()
@ICL_EVALUATORS.register_module()
class BBEHEvaluator(BaseEvaluator):
def score(self, predictions, references):
if len(predictions) != len(references):
return {
'error': 'predictions and references have different length'
}
processed_preds = [bbeh_freeform_postprocess(p) for p in predictions]
# References are already in correct format
processed_refs = [r.lower() for r in references]
details = []
correct_count = 0
for pred, ref in zip(processed_preds, processed_refs):
correct = False
# Rule 1: Exact match
if pred == ref:
correct = True
# Rule 2: Match after removing quotes/brackets
elif pred == ref.strip("'\"()[]"):
correct = True
# Rule 4: Comma - separated answers
elif ',' in ref:
norm_pred = re.sub(r'\s*,\s*', ',', pred)
norm_ref = re.sub(r'\s*,\s*', ',', ref)
if norm_pred == norm_ref:
correct = True
details.append({'pred': pred, 'answer': ref, 'correct': correct})
correct_count += int(correct)
score = (correct_count / len(predictions)) * 100
return {'score': score, 'details': details}
@ICL_EVALUATORS.register_module()
class BBEHEvaluator_mcq(BaseEvaluator):
def score(self, predictions, references):
if len(predictions) != len(references):
return {
'error': 'predictions and references have different length'
}
processed_preds = [bbeh_mcq_postprocess(p) for p in predictions]
# References are already in correct format
processed_refs = [r.lower().strip('()') for r in references]
details = []
correct_count = 0
for pred, ref in zip(processed_preds, processed_refs):
correct = False
# Rule 1: Exact match
if pred == ref:
correct = True
details.append({'pred': pred, 'answer': ref, 'correct': correct})
correct_count += int(correct)
score = (correct_count / len(predictions)) * 100
return {'score': score, 'details': details}

View File

@ -183,6 +183,33 @@ class CustomDataset(BaseDataset):
return Dataset.from_list(data)
@LOAD_DATASET.register_module()
class CodeCustomDataset(BaseDataset):
@staticmethod
def load(path, file_name=None, local_mode=False, num_repeats=1, **kwargs):
path = get_data_path(path, local_mode=local_mode)
if file_name is not None:
path = os.path.join(path, file_name)
data = []
if path.endswith('.jsonl'):
with open(path, 'r', encoding='utf-8') as f:
for line in f:
data.extend(
[json.loads(line.strip()) for _ in range(num_repeats)])
elif path.endswith('.csv'):
with open(path, 'r', encoding='utf-8-sig') as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
data.extend(
[dict(zip(header, row)) for _ in range(num_repeats)])
else:
raise ValueError(f'Unsupported file format: {path}')
return Dataset.from_list(data)
class CircularCustomDataset(CustomDataset, metaclass=CircularDatasetMeta):
dataset_class = CustomDataset

View File

@ -53,7 +53,7 @@ def compute_metrics_from_results(results, k_list=[1, 5]):
k: dict(zip(task_ids, v))
for k, v in detail_pass_at_k.items()
}
pass_at_k['detail'] = detail_metrics
pass_at_k['details'] = detail_metrics
return pass_at_k

View File

@ -0,0 +1,103 @@
import json
import os.path as osp
from datasets import Dataset
from opencompass.openicl.icl_evaluator.code_evaluator import CodeEvaluator
from opencompass.registry import LOAD_DATASET
from opencompass.utils import get_data_path
from .base import BaseDataset
# currently supporting languages
_HUMANEVAL_LANGUAGE_ = [
'adb', 'clj', 'cpp', 'cs', 'd', 'dart', 'elixir', 'go', 'hs', 'java', 'jl',
'js', 'lua', 'ml', 'php', 'pl', 'py', 'r', 'rb', 'rkt', 'rs', 'scala',
'sh', 'swift', 'ts'
]
_MBPP_LANGUAGE_ = [
'adb', 'clj', 'cpp', 'cs', 'd', 'elixir', 'go', 'hs', 'java', 'jl', 'js',
'lua', 'ml', 'php', 'pl', 'py', 'r', 'rb', 'rkt', 'rs', 'scala', 'sh',
'swift', 'ts'
]
@LOAD_DATASET.register_module()
class MultiplEDataset(BaseDataset):
@staticmethod
def load(path: str,
language: str,
num_repeats: int = 1,
tag: str = 'humaneval',
local_mode: bool = False):
"""Load dataset for pass k mode.
Args:
path(str): The path to the dataset.
language(str): The language of the dataset.
num_repeats(int): Number of repetition for this dataset to get.
tag(str): The tag of the dataset.
local_mode(bool): Whether to load the dataset in local mode.
Returns:
Dataset: A PyTorch dataset.
"""
path = get_data_path(path, local_mode=local_mode)
assert tag in ['humaneval',
'mbpp'], 'tag must be in ["humaneval", "mbpp"]'
if tag == 'humaneval':
assert language in _HUMANEVAL_LANGUAGE_, (
f'language must be in {_HUMANEVAL_LANGUAGE_}')
else:
assert language in _MBPP_LANGUAGE_, (
f'language must be in {_MBPP_LANGUAGE_}')
file_path = osp.join(path, f'{tag}-{language}.jsonl')
dataset = []
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
dataset.extend(
[json.loads(line.strip()) for _ in range(num_repeats)])
return Dataset.from_list(dataset)
class MultiplEEvaluator(CodeEvaluator):
def _stop_at_stop_token(self, decoded_string, stop_tokens):
"""Produces the prefix of decoded_string that ends at the first
occurrence of a stop_token.
WARNING: the decoded_string *must not* include the prompt,
which may have stop tokens itself.
Args:
decoded_string: A string generated by the model.
stop_tokens: A list of strings, where each string is a stop token.
Returns:
The decoded_string, truncated at the first occurrence of a stop
token.
"""
min_stop_index = len(decoded_string)
for stop_token in stop_tokens:
stop_index = decoded_string.find(stop_token)
if stop_index != -1 and stop_index < min_stop_index:
min_stop_index = stop_index
return decoded_string[:min_stop_index]
def _process_completions(self, test_case, completions):
"""Process completions with a test case.
Args:
test_case: A test case.
completions: A list of completions.
Returns:
A list of processed completions.
"""
processed_completions = []
for comp in completions:
comp = self._extract_code(comp)
post_comp = self._remove_prefix(test_case['prompt'], comp)
post_comp = self._stop_at_stop_token(post_comp,
test_case['stop_tokens'])
processed_completions.append(post_comp)
return processed_completions

View File

@ -0,0 +1,182 @@
import os
from datasets import Dataset, load_dataset
from opencompass.datasets.supergpqa.supergpqa_eval import (
extract_option_content, extract_option_labels)
from opencompass.datasets.supergpqa.supergpqa_utils import load_yaml
from opencompass.openicl.icl_evaluator import BaseEvaluator
from opencompass.registry import ICL_EVALUATORS, LOAD_DATASET
from ..base import BaseDataset
def _parse(item, template, prompt_mode):
prompt_format = [
item['question'] + '\n' + '\n'.join([
f'{chr(65+i)}) {option}'
for i, option in enumerate(item['options'])
])
]
item['infer_prompt'] = template['prompt_format'][0].format(*prompt_format)
item['prompt_mode'] = prompt_mode
return item
@LOAD_DATASET.register_module()
class SuperGPQADataset(BaseDataset):
@staticmethod
def load(path: str, prompt_mode: str, **kwargs):
dataset = load_dataset(path, split='train')
# get prompt template
template_path = None
if prompt_mode == 'zero-shot':
template_path = os.path.join(
os.path.dirname(__file__),
'supergpqa_dataset_config/prompt/zero-shot.yaml',
)
elif prompt_mode == 'five-shot':
template_path = os.path.join(
os.path.dirname(__file__),
'supergpqa_dataset_config/prompt/five-shot.yaml',
)
try:
template = load_yaml(template_path)
except FileNotFoundError:
print(f'[ERROR] Missing prompt template: {template_path}')
return Dataset.from_list([])
dataset = dataset.map(lambda item: _parse(item, template, prompt_mode))
return dataset
@ICL_EVALUATORS.register_module()
class SuperGPQAEvaluator(BaseEvaluator):
def __init__(self):
super().__init__()
def score(self, predictions, references, test_set):
mode = test_set[0]['prompt_mode']
acc = 0
count = 0
err = 0
miss = 0
acc_difficulty = {'hard': 0, 'middle': 0, 'easy': 0}
count_difficulty = {'hard': 0, 'middle': 0, 'easy': 0}
stats = {'discipline': {}, 'field': {}, 'subfield': {}}
details = []
for i, sample in enumerate(test_set):
sample['pred'] = prediction = predictions[i]
gold = references[i]
if mode == 'zero-shot':
predict = extract_option_labels(prediction, 'ABCDEFGHIJ')
if predict is None:
predict = extract_option_content(prediction,
sample['options'])
predict = (chr(sample['options'].index(predict) +
65) if predict else None)
sample['extracted_answer'] = predict
elif mode == 'five-shot':
response = prediction.split('Question:')[0]
predict = extract_option_labels(response, 'ABCDEFGHIJ')
if predict is None:
predict = extract_option_content(response,
sample['options'])
predict = (chr(sample['options'].index(predict) +
65) if predict else None)
if predict is None:
predict = extract_option_labels(prediction, 'ABCDEFGHIJ')
if predict is None:
predict = extract_option_content(
prediction, sample['options'])
predict = (chr(sample['options'].index(predict) +
65) if predict else None)
sample['extracted_answer'] = predict
discipline = sample.get('discipline', 'unknown')
field = sample.get('field', 'unknown')
subfield = sample.get('subfield', 'unknown')
difficulty = sample.get('difficulty', 'unknown')
for level, key in [
('discipline', discipline),
# ('field', f"{discipline}/{field}"),
# ('subfield', f"{discipline}/{field}/{subfield}"),
]:
if key not in stats[level]:
stats[level][key] = {
'correct': 0,
'total': 0,
'miss': 0,
'error': 0,
'discipline': discipline,
'field': field,
'subfield': subfield,
'difficulty': {
'easy': {
'correct': 0,
'total': 0
},
'middle': {
'correct': 0,
'total': 0
},
'hard': {
'correct': 0,
'total': 0
},
},
}
stats[level][key]['total'] += 1
stats[level][key]['difficulty'][difficulty]['total'] += 1
answer_letter = sample['answer_letter']
assert answer_letter == gold
if predict and answer_letter == predict:
acc += 1
acc_difficulty[difficulty] += 1
sample['status'] = 'correct'
stats[level][key]['correct'] += 1
stats[level][key]['difficulty'][difficulty]['correct'] += 1
elif predict is None or predict == '':
miss += 1
sample['status'] = 'miss'
stats[level][key]['miss'] += 1
elif predict == 'error':
err += 1
sample['status'] = 'error'
stats[level][key]['error'] += 1
else:
sample['status'] = 'incorrect'
count += 1
count_difficulty[difficulty] += 1
details.append({
'pred': sample['pred'],
'answer': sample['answer'],
'parsed_answer': sample['extracted_answer'],
'correct': True if sample['status'] else False,
})
return {
'accuracy':
acc / count if count > 0 else 0,
'error_rate':
err / count if count > 0 else 0,
'miss_rate':
miss / count if count > 0 else 0,
'hard_accuracy':
(acc_difficulty['hard'] /
count_difficulty['hard'] if count_difficulty['hard'] > 0 else 0),
'middle_accuracy':
(acc_difficulty['middle'] / count_difficulty['middle']
if count_difficulty['middle'] > 0 else 0),
'easy_accuracy':
(acc_difficulty['easy'] /
count_difficulty['easy'] if count_difficulty['easy'] > 0 else 0),
'details':
details,
}

View File

@ -0,0 +1,17 @@
response_key: 'response'
error_key: 'error'
id_key:
- 'uuid'
prompt_key: 'prompt'
history_key: 'history'
status_key: 'status'
save_prompt: True
max_tokens: 4096
temperatrue: 0.0
max_rounds: 30
BoN: 32

View File

@ -0,0 +1,17 @@
response_key: 'response'
error_key: 'error'
id_key:
- 'uuid'
prompt_key: 'prompt'
history_key: 'history'
status_key: 'status'
save_prompt: True
max_tokens: 32768
temperatrue: 0.0
max_rounds: 30
BoN: 32

View File

@ -0,0 +1,88 @@
import yaml
class ConfigWrapper:
def __init__(self, config_path):
self._config = {}
with open(config_path, 'r') as file:
self._config = yaml.safe_load(file)
for key, value in self._config.items():
setattr(self, key, value)
def __setattr__(self, key, value):
if key.startswith('_'):
super().__setattr__(key, value)
else:
self._config[key] = value
super().__setattr__(key, value)
def __getattr__(self, key):
if key in self._config:
return self._config[key]
raise AttributeError(
f"'ConfigWrapper' object has no attribute '{key}'")
def get_id(self, data):
if isinstance(self._config.get('id_key'), str):
return data.get(self._config.get('id_key'), None)
elif isinstance(self._config.get('id_key'), list):
return '_'.join([
str(data[key]) for key in self._config.get('id_key')
if key in data
])
def print_all_keys(self):
print('config keys:')
for key, value in self._config.items():
print(f' - {key}: {value}')
config_wrapper = None
def initialize_config(config_path):
global config_wrapper
config_wrapper = ConfigWrapper(config_path)
def get_config_wrapper():
global config_wrapper
if config_wrapper is None:
raise RuntimeError(
'ConfigWrapper not initialized. Call initialize_config first.')
return config_wrapper
if __name__ == '__main__':
config_path = 'config/config.yaml'
initialize_config(config_path)
data = {
'idx':
'50',
'step':
21,
'question':
'Ciphertext: "17,156,4,54,213,17,23,84,228,54,281"\n\n'
'Please provide the decrypted answer, encapsulated in double square'
' brackets. For example, the format should be: [[decrypted answer]].',
'answer':
'[[P]]',
'category':
'Decryption',
'rule_id':
'23',
'input':
'Ciphertext: "17,156,4,54,213,17,23,84,228,54,281"',
'steps_num':
23,
'description':
'For a number c=228 in the ciphertext:\n'
'Calculate z = c^e mod n. Here ^ means multiplication.\nz is 80.'
'\nBased on the decimal number represented by z, use the ascii '
'code to find the corresponding letter as the plaintext letter p.'
'\nPlease give the letter p in [[...]] format.\n',
'atom':
80,
}
print(config_wrapper.get_id(data))

View File

@ -0,0 +1,91 @@
prompt_format:
- |
Answer the following multiple choice question. There is only one correct answer. The last line of your response should be in the format 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, D, E, F, G, H, I, or J.
Question:
A refracting telescope consists of two converging lenses separated by 100 cm. The eye-piece lens has a focal length of 20 cm. The angular magnification of the telescope is
A) 10
B) 40
C) 6
D) 25
E) 15
F) 50
G) 30
H) 4
I) 5
J) 20
Answer: Let's think step by step. In a refracting telescope, if both lenses are converging, the focus of both lenses must be between the two lenses, and thus the focal lengths of the two lenses must add up to their separation. Since the focal length of one lens is 20 cm, the focal length of the other must be 80 cm. The magnification is the ratio of these two focal lengths, or 4.
Answer: H.
Question:
Say the pupil of your eye has a diameter of 5 mm and you have a telescope with an aperture of 50 cm. How much more light can the telescope gather than your eye?
A) 1000 times more
B) 50 times more
C) 5000 times more
D) 500 times more
E) 10000 times more
F) 20000 times more
G) 2000 times more
H) 100 times more
I) 10 times more
J) N/A
Answer: Let's think step by step. The amount of light a telescope can gather compared to the human eye is proportional to the area of its apertures. The area of a circle is given by the formula $A = \pi \left(\frac{{D}}{{2}}\right)^2$, where $D$ is the diameter. Therefore, the relative light-gathering power is calculated as:
\[
\frac{{\left(\frac{{50 \text{{ cm}}}}{{2}}\right)^2}}{{\left(\frac{{5 \text{{ mm}}}}{{2}}\right)^2}} = \frac{{\left(\frac{{50 \text{{ cm}}}}{{0.1 \text{{ cm}}}}\right)^2}}{{\left(\frac{{5 \text{{ mm}}}}{{0.1 \text{{ cm}}}}\right)^2}} = \frac{{500^2}}{{5^2}} = 10000.
\]
Answer: E.
Question:
Where do most short-period comets come from and how do we know?
A) The Kuiper belt; short period comets tend to be in the plane of the solar system like the Kuiper belt.
B) The asteroid belt; short period comets tend to come from random directions indicating a spherical distribution of comets called the asteroid belt.
C) The asteroid belt; short period comets tend to be in the plane of the solar system just like the asteroid belt.
D) The Oort cloud; short period comets have orbital periods similar to asteroids like Vesta and are found in the plane of the solar system just like the Oort cloud.
E) The Oort Cloud; short period comets tend to come from random directions indicating a spherical distribution of comets called the Oort Cloud.
F) The Oort cloud; short period comets tend to be in the plane of the solar system just like the Oort cloud.
G) The asteroid belt; short period comets have orbital periods similar to asteroids like Vesta and are found in the plane of the solar system just like the asteroid belt.
Answer: Let's think step by step. Most short-period comets originate from the Kuiper belt. This is deduced from the observation that these comets tend to follow orbits that lie in the plane of the solar system, similar to the distribution of objects in the Kuiper belt itself. Thus, the alignment of these cometary orbits with the ecliptic plane points to their Kuiper belt origin.
Answer: A.
Question:
Colors in a soap bubble result from light
A) dispersion
B) deflection
C) refraction
D) reflection
E) interference
F) converted to a different frequency
G) polarization
H) absorption
I) diffraction
J) transmission
Answer: Let's think step by step. The colorful patterns observed in a soap bubble are caused by the phenomenon of light interference. This occurs when light waves bounce between the two surfaces of the soap film, combining constructively or destructively based on their phase differences and the varying thickness of the film. These interactions result in vibrant color patterns due to variations in the intensity of different wavelengths of light.
Answer: E.
Question:
A microwave oven is connected to an outlet, 120 V, and draws a current of 2 amps. At what rate is energy being used by the microwave oven?
A) 240 W
B) 120 W
C) 10 W
D) 480 W
E) 360 W
F) 200 W
G) 30 W
H) 150 W
I) 60 W
J) 300 W
Answer: Let's think step by step. The rate of energy usage, known as power, in an electrical circuit is calculated by the product of voltage and current. For a microwave oven connected to a 120 V outlet and drawing a current of 2 amps, the power consumption can be calculated as follows:
\[
\text{{Power}} = \text{{Voltage}} \times \text{{Current}} = 120 \, \text{{V}} \times 2 \, \text{{A}} = 240 \, \text{{W}}.
\]
Therefore, the microwave oven uses energy at a rate of 240 watts.
Answer: A.
Question:
{}
Answer: Let's think step by step.

View File

@ -0,0 +1,23 @@
initial_prompt_0:
- |
Answer the following multiple choice question. There is only one correct answer. The last line of your response should be in the format 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, D, E, F, G, H, I, or J.
{}
initial_prompt_1:
- |
You are a helpful assistant. Answer the given multiple-choice question. Only one option is correct. The last line of your response should be in the format 'The correct answer is: $LETTER', where LETTER is one of A, B, C, D, E, F, G, H, I, or J.
{}
initial_prompt_2:
- |
Select the correct answer for the following multiple-choice question. There is only one valid choice. The last line of your response should be in the format 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, D, E, F, G, H, I, or J.
{}
initial_prompt_3:
- |
Review the following multiple-choice question and choose the one correct answer. Ensure that your response concludes with a line exactly formatted as 'The correct answer is: $LETTER', where LETTER represents one of A, B, C, D, E, F, G, H, I, or J.
{}

View File

@ -0,0 +1,5 @@
prompt_format:
- |
Answer the following multiple choice question about {}. There is only one correct answer. The last line of your response should be in the format 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, D, E, F, G, H, I, or J.
{}

View File

@ -0,0 +1,5 @@
prompt_format:
- |
Answer the following multiple choice question. There is only one correct answer. The last line of your response should be in the format 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, D, E, F, G, H, I, or J.
{}

View File

@ -0,0 +1,96 @@
# flake8: noqa: W605
import re
import timeout_decorator
@timeout_decorator.timeout(5) # 5 seconds timeout
def safe_regex_search(pattern, text, flags=0):
try:
return re.search(pattern, text, flags)
except timeout_decorator.TimeoutError:
print(f'Regex match timeout: pattern={pattern}, text={text[:100]}...')
return None
except Exception as e:
print(f'Regex match error: {str(e)}')
return None
def extract_option_labels(text, options='ABCDEFGHIJ'):
if not isinstance(text, str) or not isinstance(options, str):
return 'error'
text = text.rstrip()
last_line = text.split('\n')[-1]
option_str = ''.join([chr(65 + i) for i in range(len(options))
]) if options else 'ABCDEFGHIJ'
patterns = [
# e.g. "The final answer to this question is: A."
# "The best option is $\boxed{B}:"
# "The correct answer is (C)."
f'[Tt]he\s+(?:\w+\s+)?(?:answer|option)(?:\w+\s+)?\s+is?:?\s*(?:[\*\$\\{{(\[\\\\(]*?(?:(?:\\\\boxed|\\\\mathbf|\\\\mathrm|\\\\text){{)?)*\s*([{option_str}])(?:\\\\?\}}?\$?\)?\]?\}}?)*(?:[\s:\.\*)]|$)',
# e.g. "ANSWER: A"
# "Answer: $\boxed{B}."
# "ANSWER: (C):"
f'(?i:Answer)[\*\s]*:\s*(?:[\*\$\\{{(\[\\\\(]*?(?:(?:\\\\boxed|\\\\mathbf|\\\\mathrm|\\\\text){{)?)*\s*([{option_str}])(?:\\\\?\}}?\$?\)?\]?\}}?)*(?:[\s:\.\*)]|$)',
# e.g. "A"
# "$\boxed{B}$"
# "(C)."
# "[D]:"
f'^[^\w\r\n]*(?:[\*\$\\{{(\[\\\\(]*?(?:(?:\\\\boxed|\\\\mathbf|\\\\mathrm|\\\\text){{)?)*\s*([{option_str}])(?:\\\\?\}}?\$?\)?\]?\}}?)*(?:[\s:\.\*)]|$)',
]
for pattern in patterns:
match = safe_regex_search(pattern, last_line, re.IGNORECASE)
if match:
return match.group(1)
for pattern in patterns:
match = safe_regex_search(pattern, text, re.IGNORECASE)
if match:
return match.group(1)
return None
def extract_option_content(text, options_content=None):
if not isinstance(text, str) or not isinstance(options_content, list):
return 'error'
escaped_options_content = [
re.escape(option_content) for option_content in options_content
]
escaped_options_content_str = '|'.join(escaped_options_content)
text = text.rstrip()
last_line = text.split('\n')[-1]
patterns = [
f'[Tt]he\s+(?:\w+\s+)?(?:answer|option)(?:\w+\s+)?\s+is:?\s*(?:[\*\$\\{{\(\[\\\\(]*?(?:(?:\\\\boxed|\\\\mathbf|\\\\mathrm|\\\\text){{)?)*\s*({escaped_options_content_str})(?:\\\\?\}}?\$?\)?\]?\}}?)*(?:[\s:\.\*)]|$)',
f'(?i:Answer)\s*(?:[\*\$\\{{\(\[\\\\(]*?(?:(?:\\\\boxed|\\\\mathbf|\\\\mathrm|\\\\text){{)?)*\s*({escaped_options_content_str})(?:\\\\?\}}?\$?\)?\]?\}}?)*(?:[\s:\.\*)]|$)',
f'^[^\w\r\n]*(?:[\*\$\\{{\(\[\\\\(]*?(?:(?:\\\\boxed|\\\\mathbf|\\\\mathrm|\\\\text){{)?)*\s*({escaped_options_content_str})(?:\\\\?\}}?\$?\)?\]?\}}?)*(?:[\s:\.\*)]|$)',
]
for pattern in patterns:
match = safe_regex_search(pattern, last_line)
if match:
if match.group(1) in escaped_options_content:
return options_content[escaped_options_content.index(
match.group(1))]
else:
return match.group(1)
for pattern in patterns:
match = safe_regex_search(pattern, text)
if match:
if match.group(1) in escaped_options_content:
return options_content[escaped_options_content.index(
match.group(1))]
else:
return match.group(1)
return None

View File

@ -0,0 +1,693 @@
import json
import os
import re
import sympy as sp
import yaml
from sympy.parsing.latex import parse_latex
def load_yaml(yaml_path):
"""Load a YAML file."""
if not os.path.exists(yaml_path):
raise FileNotFoundError(f'YAML file not found: {yaml_path}')
with open(yaml_path, 'r', encoding='utf-8') as file:
return yaml.safe_load(file)
def load_json_or_jsonl(file_path):
"""Load data from a JSON or JSONL file."""
if not os.path.exists(file_path):
return None
with open(file_path, 'r', encoding='utf-8') as file:
if file_path.endswith('.json'):
return json.load(file)
elif file_path.endswith('.jsonl'):
return [json.loads(line) for line in file]
return None
def find_file(base_path, sub_path, extensions=('json', 'jsonl')):
"""Find the first available file with given extensions."""
for ext in extensions:
file_path = os.path.join(base_path, f'{sub_path}.{ext}')
if os.path.exists(file_path):
return file_path
return None
def load_json_or_jsonl_with_idx(data_path, split='', idx=None):
base_path = os.path.join(data_path, split)
if os.path.exists(f'{base_path}.json'):
file_path = f'{base_path}.json'
elif os.path.exists(f'{base_path}.jsonl'):
file_path = f'{base_path}.jsonl'
elif base_path.endswith('.json') or base_path.endswith('.jsonl'):
file_path = base_path
else:
raise FileNotFoundError('No JSON or JSONL file found.')
with open(file_path, 'r', encoding='utf-8') as file:
if file_path.endswith('.json'):
data = json.load(file)
elif file_path.endswith('.jsonl'):
data = [json.loads(line) for line in file]
if idx is not None:
try:
return next(item for item in data if item.get('idx') == idx)
except StopIteration:
raise ValueError(f'No entry found for idx {idx}')
else:
return data
def load_split_data(base_path, split_name):
"""Load the rule and sample data for a specific split."""
split_path = os.path.join(base_path, split_name)
rule_path = find_file(split_path, 'rule')
sample_path = find_file(split_path, 'sample')
rules = load_json_or_jsonl(rule_path) if rule_path else []
samples = load_json_or_jsonl(sample_path) if sample_path else []
return {'rules': rules, 'samples': samples}
def process_mixed_data(base_path, mode):
"""Load and process data for the 'mixed' split and specific mode."""
mixed_path = os.path.join(base_path, 'mixed')
file_path = find_file(mixed_path, mode)
if not file_path:
print(f'[WARNING] Missing file for mixed mode: {mode}')
return []
data = load_json_or_jsonl(file_path)
template_path = os.path.join(base_path, 'config/prompt/mixed.yaml')
template = load_yaml(template_path)
processed = []
for item in data:
rules = '\n'.join(item.get('rule_list', []))
questions = '\n'.join(item.get('question_list', []))
item['prompt'] = template['prompt_format'][0].format(rules, questions)
processed.append(item)
return processed
class ConfigWrapper:
def __init__(self, config_path):
self._config = {}
with open(config_path, 'r') as file:
self._config = yaml.safe_load(file)
for key, value in self._config.items():
setattr(self, key, value)
def __setattr__(self, key, value):
if key.startswith('_'):
super().__setattr__(key, value)
else:
self._config[key] = value
super().__setattr__(key, value)
def __getattr__(self, key):
if key in self._config:
return self._config[key]
raise AttributeError(
f"'ConfigWrapper' object has no attribute '{key}'")
def get_id(self, data):
if isinstance(self._config.get('id_key'), str):
return data.get(self._config.get('id_key'), None)
elif isinstance(self._config.get('id_key'), list):
return '_'.join([
str(data[key]) for key in self._config.get('id_key')
if key in data
])
def print_all_keys(self):
print('config keys:')
for key, value in self._config.items():
print(f' - {key}: {value}')
config_wrapper = None
def initialize_config(config_path):
global config_wrapper
config_wrapper = ConfigWrapper(config_path)
def get_config_wrapper():
global config_wrapper
if config_wrapper is None:
raise RuntimeError(
'ConfigWrapper not initialized. Call initialize_config first.')
return config_wrapper
if __name__ == '__main__':
config_path = 'config/config.yaml'
initialize_config(config_path)
data = {
'idx':
'50',
'step':
21,
'question':
('Ciphertext: "17,156,4,54,213,17,23,84,228,54,281"\n\n'
'Please provide the decrypted answer, encapsulated in double '
'square brackets. '
'For example, the format should be: [[decrypted answer]].'),
'answer':
'[[P]]',
'category':
'Decryption',
'rule_id':
'23',
'input':
'Ciphertext: "17,156,4,54,213,17,23,84,228,54,281"',
'steps_num':
23,
'description':
('For a number c=228 in the ciphertext:\n'
'Calculate z = c^e mod n. Here ^ means multiplication.\n'
'z is 80.\nBased on the decimal number represented by z, '
'use the ascii code to find the corresponding letter '
'as the plaintext letter p.\n'
'Please give the letter p in [[...]] format.\n'),
'atom':
80
}
print(config_wrapper.get_id(data))
def read_yaml(config='default'):
if os.path.exists(f'config/prompt/{config}.yaml'):
yaml_file = f'config/prompt/{config}.yaml'
else:
yaml_file = config
with open(yaml_file, 'r') as yaml_file:
return yaml.safe_load(yaml_file)
def write_jsonl_lines(file, data):
config_wrapper = get_config_wrapper()
if config_wrapper.save_prompt:
json.dump(data, file, ensure_ascii=False)
else:
data.pop(config_wrapper.prompt_key)
json.dump(data, file, ensure_ascii=False)
file.write('\n')
file.flush()
def print_info(info):
print('-' * 100)
print('[INFO] model_name:', info['model_name'])
print('[INFO] splits:', info['splits'])
print('[INFO] modes:', info['modes'])
print('[INFO] output_dir:', info['output_dir'])
print('[INFO] Infer Limit:',
'No limit' if info['infer_limit'] is None else info['infer_limit'])
print('[INFO] Number of Workers:', info['num_workers'])
print('[INFO] Batch Size:', info['batch_size'])
print('[INFO] Use Accel:', info['use_accel'])
print('-' * 100)
def read_json_or_jsonl(data_path, split='', mapping_key=None):
base_path = os.path.join(data_path, split)
if os.path.exists(f'{base_path}.json'):
file_path = f'{base_path}.json'
elif os.path.exists(f'{base_path}.jsonl'):
file_path = f'{base_path}.jsonl'
elif base_path.endswith('.json') or base_path.endswith('.jsonl'):
file_path = base_path
else:
raise FileNotFoundError('No JSON or JSONL file found.')
with open(file_path, 'r') as file:
if file_path.endswith('.json'):
data = json.load(file)
elif file_path.endswith('.jsonl'):
data = [json.loads(line) for line in file]
if mapping_key:
return {
item[mapping_key]: item
for item in data if mapping_key in item
}
else:
return data
def read_json_or_jsonl_with_idx(data_path, split='', idx=None):
base_path = os.path.join(data_path, split)
if os.path.exists(f'{base_path}.json'):
file_path = f'{base_path}.json'
elif os.path.exists(f'{base_path}.jsonl'):
file_path = f'{base_path}.jsonl'
elif base_path.endswith('.json') or base_path.endswith('.jsonl'):
file_path = base_path
else:
raise FileNotFoundError('No JSON or JSONL file found.')
with open(file_path, 'r', encoding='utf-8') as file:
if file_path.endswith('.json'):
data = json.load(file)
elif file_path.endswith('.jsonl'):
data = [json.loads(line) for line in file]
if idx is not None:
try:
return next(item for item in data if item.get('idx') == idx)
except StopIteration:
raise ValueError(f'No entry found for idx {idx}')
else:
return data
idx_ranges = [
[18],
[73, 74, 77],
[94],
[115, 116, 117],
[121, 122, 123, 125],
[131, 132, 134, 135, 136],
[141, 143, 149],
list(range(145, 148)),
list(range(151, 157)),
[160, 161, 162],
[164, 165, 166],
[170],
[206, 209],
list(range(211, 216)),
[217, 218],
]
def clean_json_string(json_str):
json_str = re.sub(r'[\x00-\x1F\x7F]', '', json_str)
return json_str
def is_in_idx_ranges(idx, idx_ranges):
for range_list in idx_ranges:
if int(idx) in range_list:
return True
return False
def extract_json(text):
matches = re.findall(r'{.*}', text, re.DOTALL)
if matches:
json_str = matches[-1]
json_str = clean_json_string(json_str)
try:
data = json.loads(json_str)
return data
except json.JSONDecodeError as e:
print(f'Error decoding JSON: {e}')
return 'NULL'
return 'NULL'
def extract_all_responses_from_json(response_json):
results = []
for key, value in response_json.items():
results.append(str(value))
return results
def clean_latex(latex_expr):
if '=' in latex_expr:
latex_expr = latex_expr.rsplit('=', 1)[1]
latex_expr = re.sub(r'\\[()\[\]]', '', latex_expr)
latex_expr = re.sub(r'\\text\{.*?\}', '', latex_expr)
latex_expr = re.sub(r'\\(left|right|displaystyle)', '', latex_expr)
latex_expr = latex_expr.replace('\\\\', '\\')
return latex_expr
def extract_text_from_brackets(text, clean_level='basic'):
matches = re.findall(r'\[\[\s*(.*?)\s*\]\]', text, re.DOTALL)
if not matches:
matches = re.findall(r'\$\\boxed\{(.*?)\}\$', text, re.DOTALL)
if not matches:
matches = re.findall(r'\[\s*(.*?)\s*\]', text, re.DOTALL)
if matches:
match_str = matches[0].strip()
if clean_level == 'clean':
match_str = match_str.replace('"', '').replace('\n', '').replace(
' ', '').replace('[', '').replace(']', '')
elif clean_level == 'logic':
match_str = match_str.replace('"', '').replace('\n', '').replace(
' ', '').replace('.', '')
elif clean_level == 'math':
match_str = match_str.replace('"', '').replace('\n', '').replace(
'[', '').replace(']', '').replace('$', '')
return f'{clean_latex(match_str)}'
return f'[[{match_str}]]'
return 'NULL'
def extract_inner_text_from_brackets(text):
if not isinstance(text, str):
print(f'text type: {type(text)}, text value: {text}')
return 'NULL'
match = re.search(r'\[\[(.*?)\]\]', text, re.DOTALL)
return match.group(1) if match else 'NULL'
def extract_numbers(str):
numbers = re.findall(r'\d+', str)
numbers = list(map(int, numbers))
return numbers
def extract_and_sort_inequalities(latex_expr):
pattern = r'(≥|≤)\s*([-]?\d+\.?\d*)'
matches = re.findall(pattern, latex_expr)
extracted_inequalities = [''.join(match) for match in matches]
sorted_inequalities = sorted(extracted_inequalities)
return sorted_inequalities
def rule5_normalize_content(content):
parts = [part for part in content.split(';')]
sorted_parts = sorted(parts)
return sorted_parts
def normalize_string(s):
s = re.sub(r'[^0-9]', '', s)
pairs = s.split(',')
pairs.sort()
return pairs
def remove_commas_and_spaces(s):
return re.sub(r'[,\s\[\]]+', '', s)
def remove_non_alphanumeric(s):
return re.sub(r'\W+', '', s)
def contains_or(answer):
return 'or' in answer
def compare_multi_results(response, answer):
try:
response_text = extract_text_from_brackets(response, 'clean')
response_text = re.sub(r'\\text\{or\}', 'or', response_text)
if response_text == 'NULL':
return False
answer = extract_text_from_brackets(answer, 'clean')
response_split = response_text.strip('[[]]').split('or')
answer_split = answer.strip('[[]]').split('or')
response_sorted = sorted([x.strip() for x in response_split])
answer_sorted = sorted([x.strip() for x in answer_split])
return response_sorted == answer_sorted
except Exception as e:
print(f'Error during comparison: {e}')
return False
def split_or_expression(expression):
return [part.strip() for part in expression.split('or')]
def compare_math_expressions(response, answer):
response_text = extract_text_from_brackets(response, 'math')
answer_text = extract_text_from_brackets(answer, 'math')
if response_text == 'NULL':
return False
if contains_or(answer_text):
response_parts = split_or_expression(response_text)
answer_parts = split_or_expression(answer_text)
try:
response_exprs = {
sp.simplify(parse_latex(part))
for part in response_parts
}
answer_exprs = {
sp.simplify(parse_latex(part))
for part in answer_parts
}
return response_exprs == answer_exprs
except Exception as e:
print(f'Error during simplification or parsing: {e}')
return response_text == answer_text
else:
try:
response_expr = sp.simplify(parse_latex(response_text))
answer_expr = sp.simplify(parse_latex(answer_text))
return response_expr == answer_expr
except Exception as e:
print(f'Error during simplification or parsing: {e}')
return response_text == answer_text
def method_equal(response_text, answer):
return response_text == answer
def method_1(response_text, answer):
cleaned_string = re.sub(r'[^A-Za-z]', '', response_text)
cleaned_string = cleaned_string.lower()
answer = re.sub(r'[^A-Za-z]', '', answer)
answer = answer.lower()
return cleaned_string == answer
def method_2(response_text, answer):
cleaned_string = re.sub(r'[^A-Za-z]', '', response_text)
cleaned_string = cleaned_string.lower()
answer = answer.split(',')
return cleaned_string in answer
def method_3(response_text, answer):
response_text = response_text.lower()
pairs1 = re.split(r'\W+', response_text)
pairs2 = answer.split(' ')
pairs1 = [word for word in pairs1 if word]
pairs1.sort()
pairs2.sort()
return pairs1 == pairs2
def method_4(response_text, answer):
cleaned_string = re.sub(r'[^A-Za-z]', '', response_text)
cleaned_string = cleaned_string.lower()
return cleaned_string in answer
def method_5(response_text, answer):
response_text = re.sub(r'\s+', '', response_text)
response_text = response_text.split(',')
answer = answer.split(',')
response_text.sort()
answer.sort()
return response_text == answer
def method_9(response_text, answer):
response_text = response_text.replace('×', '*').replace('', '-')
answer = answer.replace('×', '*').replace('', '-')
def extract_operators(s):
return re.findall(r'[+\-*/]', s)
response_ops = extract_operators(response_text.split('=')[0])
answer_ops = extract_operators(answer.split('=')[0])
if response_ops != answer_ops:
return False
match = re.search(r'=\s*(-?\d+)', answer)
expected_result = int(match.group(1))
try:
left_side = response_text.split('=')[0]
result = eval(left_side)
except Exception as e:
print(f'Error during evaluation: {e}')
return False
return result == expected_result
def method_10(response_text, answer):
response_text = response_text.replace('×', '*').replace('', '-')
response_text = response_text.split('=')[0]
answer = answer.split('\n')[0].split('=')[0]
response_ops = sorted(remove_non_alphanumeric(response_text))
answer_ops = sorted(remove_non_alphanumeric(answer))
if response_ops != answer_ops:
return False
try:
result = eval(response_text)
except Exception as e:
print(f'Error during evaluation: {e}')
return False
return result == 24
def method_18(response_text, answer):
cleaned_s1 = remove_commas_and_spaces(response_text)
cleaned_s2 = remove_commas_and_spaces(answer)
return cleaned_s1 == cleaned_s2
def method_general(response_text, answer):
cleaned_s1 = remove_non_alphanumeric(response_text)
cleaned_s2 = remove_non_alphanumeric(answer)
return cleaned_s1 == cleaned_s2
question_methods = {
'1': method_1,
'2': method_2,
'3': method_3,
'4': method_4,
'5': method_5,
'9': method_9,
'10': method_10,
'18': method_18,
}
def evaluate_response_vs_answer(response, answer, question_type, rule_id, idx):
if question_type == 'logic' and rule_id == '5':
response_text = extract_text_from_brackets(response, 'logic')
answer_text = extract_text_from_brackets(answer, 'logic')
if response_text is None:
return False
normalized_response = rule5_normalize_content(response_text)
normalized_answer = rule5_normalize_content(answer)
return normalized_response == normalized_answer
elif question_type == 'logic':
response_text = extract_text_from_brackets(response, 'logic')
answer_text = extract_text_from_brackets(answer, 'logic')
return response_text == answer_text
elif question_type == 'operation' and (idx == '178' or idx == '179'):
response_text = extract_text_from_brackets(response, 'clean')
response_text = extract_and_sort_inequalities(response_text)
answer_text = extract_and_sort_inequalities(answer)
# print(response_text, answer_text)
return response_text == answer_text
elif question_type == 'operation' and rule_id == '18':
response_text = extract_text_from_brackets(response, 'clean')
answer = extract_inner_text_from_brackets(answer)
response_text = ''.join(sorted(re.sub(r'\W+', '', response_text)))
answer = ''.join(sorted(re.sub(r'\W+', '', answer)))
return response_text == answer
elif question_type == 'operation' and rule_id in {'23', '24', '25'}:
response_text = extract_text_from_brackets(response, 'clean')
if response_text is None:
return False
response_text = extract_numbers(response_text)
answer_text = extract_numbers(answer)
return response_text == answer_text
elif question_type == 'operation' and is_in_idx_ranges(idx, idx_ranges):
return compare_math_expressions(response, answer)
elif question_type == 'operation' and contains_or(answer):
return compare_multi_results(response, answer)
elif question_type == 'puzzle':
response_text = extract_inner_text_from_brackets(response)
answer = extract_inner_text_from_brackets(answer)
method = question_methods.get(rule_id)
if method:
return method(response_text, answer)
return method_general(response_text, answer)
else:
response_text = extract_text_from_brackets(response, 'clean')
return response_text == answer
def compute_one_mixed_question_pass_rate(idx,
question_list,
response_json,
base_path=None):
if response_json == 'NULL':
result_dict = {
'idx': idx,
'response': response_json,
'details': None,
'pass_rate': 0,
'is_correct': False
}
return result_dict
response_list = extract_all_responses_from_json(response_json)
correct_num = 0
results = []
for q_idx, question in enumerate(question_list):
category, question_idx = question.rsplit('_', 1)
question_content = load_json_or_jsonl_with_idx(base_path,
os.path.join(
category, 'sample'),
idx=question_idx)
answer = question_content['answer']
if q_idx >= len(response_list):
break
response = response_list[q_idx]
response_text = extract_text_from_brackets(response)
rule_id = question_content['rule_id']
is_correct = evaluate_response_vs_answer(response, answer, category,
rule_id, q_idx)
if is_correct:
correct_num += 1
results.append({
'question': question,
'response_text': response_text,
'answer': answer,
'is_correct': is_correct
})
pass_rate = correct_num / len(question_list)
question_correct = pass_rate == 1.0
result_dict = {
'idx': idx,
'response': response_json,
'details': results,
'pass_rate': pass_rate,
'is_correct': question_correct
}
return result_dict
def evaluate_responses(data, mode, base_path=None):
results = []
# Iterate over the values of the dictionary (numerical keys)
for key, record in data.items():
idx = key # Use the dictionary key as the "idx"
response = record.get('prediction', '')
question_type = record.get('category', '')
response_text = extract_text_from_brackets(response)
answer = record.get('gold', '')
rule_id = record.get('rule_id', '')
is_correct = evaluate_response_vs_answer(response, answer,
question_type, rule_id, idx)
result_dict = {
'idx': idx,
'response': response,
'response_text': response_text,
'answer': answer,
'is_correct': is_correct
}
if question_type == 'counterfactual':
real_life_answer = record.get('real_life_answer', '')
is_real_life = evaluate_response_vs_answer(response,
real_life_answer,
question_type, rule_id,
idx)
result_dict['real_life_answer'] = real_life_answer
result_dict['is_real_life'] = is_real_life
if question_type == 'cipher' and mode == 'subquestions':
result_dict['type'] = record.get('type', '')
results.append(result_dict)
return results

View File

@ -1,3 +1,4 @@
import os
import os.path as osp
from typing import Dict, List, Optional
@ -36,7 +37,11 @@ class GenericLLMEvaluator(BaseEvaluator):
) -> None:
self.logger = get_logger()
self.judge_cfg = judge_cfg
# If judge_cfg is not provided, fall back to the default configuration
if not judge_cfg:
self.judge_cfg = self.default_judge_cfg
else:
self.judge_cfg = judge_cfg
self.output_path = ''
self.prompt_template = ICL_PROMPT_TEMPLATES.build(prompt_template)
@ -141,3 +146,30 @@ class GenericLLMEvaluator(BaseEvaluator):
kwargs = self.dict_postprocessor
proc = DICT_POSTPROCESSORS.get(kwargs.pop('type'))
return proc(output, self.output_path, **kwargs)
@property
def default_judge_cfg(self):
from opencompass.models import OpenAISDK
DEFAULT_JUDGE_CFG = dict(
type=OpenAISDK,
path=os.environ['OC_JUDGE_MODEL'],
key=os.environ['OC_JUDGE_API_KEY'],
openai_api_base=[
os.environ.get('OC_JUDGE_API_BASE',
'https://api.openai.com/v1/')
],
meta_template=dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
], ),
query_per_second=16,
batch_size=1024,
temperature=0.001,
tokenizer_path='gpt-4o-2024-05-13',
verbose=True,
max_out_len=16384,
max_seq_len=49152,
)
return DEFAULT_JUDGE_CFG

View File

@ -399,7 +399,7 @@ class OpenAI(BaseAPIModel):
self.logger.info(
f'Successfully load default tiktoken tokenizer: '
f' {default_tokenizer}')
return len(enc.encode(prompt))
return len(enc.encode(prompt, disallowed_special=()))
def _bin_trim(self, prompt: str, num_token: int, mode: str) -> str:
"""Get a suffix of prompt which is no longer than num_token tokens.

View File

@ -12,3 +12,4 @@ from .icl_misc_evaluator import AveragePPLEvaluator # noqa
from .icl_plugin_evaluator import TEvalEvaluator # noqa
from .icl_toxic_evaluator import ToxicEvaluator # noqa
from .lm_evaluator import LMEvaluator # noqa
from .math_evaluator import MATHEvaluator # noqa

View File

@ -0,0 +1,267 @@
# flake8: noqa: E501
import difflib
import os
import re
import tempfile
import time
from typing import Any, Dict, List, Optional, Tuple, Union
from datasets import Dataset
from gradio_client import Client
from opencompass.openicl.icl_evaluator import BaseEvaluator
from opencompass.registry import ICL_EVALUATORS
@ICL_EVALUATORS.register_module()
class CodeEvaluator(BaseEvaluator):
"""Evaluator for code generation tasks.
This evaluator sends code to a remote evaluation service to test its
functionality against provided test cases. It handles code extraction,
processing, and result analysis.
"""
def __init__(self,
language: str,
ip_address: str = 'localhost',
retry: int = 3) -> None:
"""Initialize the CodeEvaluator.
Args:
language (str): Programming language of the code to evaluate.
ip_address (str, optional): IP address of the evaluation service. Defaults to 'localhost'.
retry (int, optional): Number of retry attempts for failed connections. Defaults to 3.
"""
self.language = language
self.retry = retry
self.client = Client(ip_address)
super().__init__()
def _extract_code(self, text: str) -> str:
"""Extract code from markdown-formatted text.
Args:
text (str): Text that may contain code blocks in markdown format.
Returns:
str: Extracted code from the last code block, or the original text if no code blocks found.
"""
blocks = re.findall(r'```\w*\n(.*?)```', text, re.DOTALL)
if len(blocks) >= 1:
text = blocks[0]
return text
def _code_eval_service(
self, input_data: Union[Dict, List,
str]) -> Tuple[bool, Union[Dict, List, Any]]:
"""Send code to the remote evaluation service using gradio_client and
get the results.
Args:
input_data: Can be one of:
- dict: Dictionary containing code information for a single test case
- list: List of dictionaries for batch evaluation
- str: File path to code file
Returns:
tuple: (succeed, output)
- succeed (bool): Whether the request was successful
- output (dict/list/str): Evaluation results or error message
"""
try:
temp_file_path = None
# Handle file path input
if isinstance(input_data, str):
with tempfile.NamedTemporaryFile(suffix=f'.{self.language}',
delete=False) as temp_file:
temp_file_path = temp_file.name
with open(input_data, 'r') as src_file:
content = src_file.read()
temp_file.write(content.encode())
input_data = temp_file_path
# Send to evaluation service
result = self.client.predict(input_data, api_name='/evaluate')
# Process the result
if isinstance(result, (dict, list)):
return True, result
else:
# Try to parse the result as JSON if it's a string
try:
import json
parsed_result = json.loads(result)
return True, parsed_result
except: # noqa: E722
return True, {'status': 'unknown', 'raw_result': result}
except Exception as e:
return False, str(e)
finally:
# Clean up temporary file if it was created
if temp_file_path and os.path.exists(temp_file_path):
try:
os.unlink(temp_file_path)
except: # noqa: E722
pass
def _remove_prefix(self,
prompt: str,
completion: str,
threshold: float = 0.95) -> str:
"""Determine the truncation point in the completion based on the last
line of the prompt, remove all content before that line in the
completion, and return the completion string after removing the prefix.
This is done to convert chatbot-style inference mode to completion
mode.
Args:
prompt (str): The prompt text.
completion (str): The completion text.
threshold (float): Line similarity threshold.
Returns:
str: The completion string after removing the prefix.
"""
prompt_lines = prompt.splitlines()
completion_lines = completion.splitlines()
if not prompt_lines:
return completion
last_prompt_line = prompt_lines[-1]
cut_index = -1
for i, completion_line in enumerate(completion_lines):
similarity = difflib.SequenceMatcher(None, last_prompt_line,
completion_line).ratio()
if similarity >= threshold:
cut_index = i
break
if cut_index != -1:
return '\n'.join(completion_lines[cut_index + 1:])
else:
return completion
def _process_completions(self, test_case: dict, completions: list) -> list:
"""Process code completion list, which typically involves extracting
code, removing repetitive prefixes caused by chatbot mode, and other
steps to ensure the model-generated code can be compiled successfully.
Args:
test_case (dict): Dictionary containing test case information including:
completions (list): List of code completions generated by the model.
Returns:
list: Processed code completion list.
"""
processed_completions = []
for comp in completions:
comp = self._extract_code(comp)
post_comp = self._remove_prefix(test_case['prompt'], comp)
processed_completions.append(post_comp)
return processed_completions
def _evaluate(
self, input_data: Union[Dict, List]
) -> Tuple[bool, Optional[Union[Dict, List]], Optional[str]]:
"""Evaluate code with retry mechanism.
Args:
input_data: Can be either:
- dict: Dictionary containing code and test information for a single test case
- list: List of dictionaries for batch evaluation
Returns:
tuple: (success, output, error_message)
- success (bool): Whether the evaluation was successful
- output (dict or list): Evaluation output (if successful)
- error_message (str): Error message (if failed)
"""
num_retry = 0
while num_retry < self.retry:
succeed, output = self._code_eval_service(input_data)
if not succeed:
num_retry += 1
time.sleep(10)
else:
break
if not succeed:
return False, None, f'code eval service connection failed: {output}'
return True, output, None
def score(self, predictions: List, references: List,
test_set: Dataset) -> Dict:
"""Score code generation predictions against references.
Args:
predictions (list): List of model-generated code completions.
references (list): List of reference solutions (not directly used in evaluation).
test_set (Dataset): Dataset containing test cases and other metadata.
Returns:
dict: Evaluation results including:
- accuracy: Percentage of correctly solved problems
- details: Detailed results for each test case
- error: Error message if evaluation failed
"""
if len(predictions) != len(references):
return {
'error':
'predictions and references have different '
f'length. len(predictions): {len(predictions)}, '
f'len(references): {len(references)}'
}
test_set = test_set.to_pandas()
# Use the first column as the unique identifier
test_set_origin = test_set.drop_duplicates(subset=test_set.columns[0])
num_repeats = int(len(test_set) / len(test_set_origin))
# 1. Prepare data for all test cases
all_test_cases = []
for i in range(len(test_set_origin)):
test_case = test_set_origin.iloc[i]
completions = predictions[i * num_repeats:(i + 1) * num_repeats]
# Process code completions
processed_completions = self._process_completions(
test_case, completions)
result_dict = {
'name': test_case['name'],
'language': test_case['language'],
'prompt': test_case['prompt'],
'tests': test_case['tests'],
'processed_completions': processed_completions,
'completions': completions
}
all_test_cases.append(result_dict)
# 2. Send all test cases to the evaluation service
success, outputs, error_message = self._evaluate(all_test_cases)
if not success:
return {'error': error_message}
# 3. Process the returned results
details = []
correct = 0
for output in outputs:
if output.get('status') == 'OK':
output['correct'] = True
correct += 1
else:
output['correct'] = False
details.append(output)
return {
f'pass@{num_repeats}': 100 * correct / len(test_set_origin),
'details': details
}

View File

@ -1,4 +1,5 @@
"""Base Evaluator."""
from collections import OrderedDict
from copy import deepcopy
from typing import Any, Dict, Iterable, List, Union
@ -77,12 +78,17 @@ class BaseEvaluator:
for metric in all_metrics:
if metric in ['predictions', 'example_abbr']:
continue
g_passk_details[metric] = 100. * np.mean(
g_passk_details[metric] = 100.0 * np.mean(
[detail[metric] for detail in details])
return g_passk_details
def evaluate(self, k: Union[int, List[int]], n: int,
original_dataset: Dataset, **score_kwargs):
def evaluate(
self,
k: Union[int, List[int]],
n: int,
original_dataset: Dataset,
**score_kwargs,
):
real_size = len(original_dataset) // n
all_details = []
all_results = []
@ -146,7 +152,7 @@ class BaseEvaluator:
if can_calculate and n > 1 and k > 1:
thresholds = [0.0, 0.25, 0.5, 0.75, 1.0]
for _k in ([k] if isinstance(k, int) else k):
for _k in [k] if isinstance(k, int) else k:
for threshold in thresholds:
g_pass = compute_g_pass_at_k(n=n,
c=c,
@ -161,9 +167,31 @@ class BaseEvaluator:
if can_calculate and n > 1 and k > 1:
eval_results.update(self.reduce(eval_details))
# Store eval_details in eval_results
eval_results['details'] = eval_details
return eval_results
# Process details to flatten the predictions
for detail in eval_details:
# Extract all prediction fields and flatten them
flattened_predictions = {}
for pred in detail['predictions']:
for k, v in pred.items():
if k not in flattened_predictions:
flattened_predictions[k] = [v]
else:
flattened_predictions[k].append(v)
# Replace the predictions list with the flattened dictionary
for k, v in flattened_predictions.items():
detail[k] = v
# Remove the original predictions field
detail.pop('predictions')
return eval_results
# If there are no details, return results
return results
def score(self):
raise NotImplementedError("Method hasn't been implemented yet")

View File

@ -1,7 +1,3 @@
from latex2sympy2_extended import NormalizationConfig
from math_verify import (ExprExtractionConfig, LatexExtractionConfig, parse,
verify)
from opencompass.openicl.icl_evaluator import BaseEvaluator
from opencompass.registry import ICL_EVALUATORS
@ -10,6 +6,14 @@ from opencompass.registry import ICL_EVALUATORS
class MATHEvaluator(BaseEvaluator):
def score(self, predictions, references):
try:
from latex2sympy2_extended import NormalizationConfig
from math_verify import (ExprExtractionConfig,
LatexExtractionConfig, parse, verify)
except ImportError:
raise ImportError('Failed to import required modules. Please '
'install the necessary packages: '
'pip install math_verify latex2sympy2_extended')
self.is_num_equal(predictions, references)
@ -75,7 +79,7 @@ class MATHEvaluator(BaseEvaluator):
if __name__ == '__main__':
import sympy
from math_verify import parse
test_cases = [
# 1. Basic arithmetic operations
r'Simple fraction: \boxed{\frac{1}{2}}',

View File

@ -256,7 +256,7 @@ class VOLCRunner(BaseRunner):
with open(config_path) as fp:
volc_cfg = yaml.safe_load(fp)
if num_gpus <= 0:
flavor = 'ml.c3i.2xlarge'
flavor = 'ml.r3i.2xlarge'
elif num_gpus == 1:
flavor = 'ml.pni2l.3xlarge'
elif num_gpus == 2:

View File

@ -171,6 +171,8 @@ class DefaultSummarizer:
default_metric = 'sum'
elif sg.get('weights', []):
default_metric = 'weighted_average'
elif sg.get('harmonic_mean', False):
default_metric = 'harmonic_mean'
else:
default_metric = 'naive_average'
@ -186,24 +188,35 @@ class DefaultSummarizer:
eval_modes.append(dataset_eval_mode.get(dataset_abbr, 'unknown'))
else:
group_metrics = list(functools.reduce(lambda a, b: a & b, [set(dataset_metrics[dataset_abbr]) for dataset_abbr in sg['subsets']]))
if need_smart_metric and len(group_metrics) > 1:
for metric in group_metrics:
for dataset_abbr in sg['subsets']:
scores.setdefault(metric, {})[dataset_abbr + '@' + metric] = parsed_results[model_abbr][dataset_abbr][metric]
eval_modes.append(dataset_eval_mode.get(sg['subsets'][0], 'unknown'))
else:
group_metrics = [default_metric]
group_metrics.append(default_metric)
for metric in group_metrics:
for dataset_abbr in sg['subsets']:
metric = dataset_metrics[dataset_abbr][0]
scores.setdefault(default_metric, {})[dataset_abbr + '@' + metric] = parsed_results[model_abbr][dataset_abbr][metric]
eval_modes.append(dataset_eval_mode.get(dataset_abbr, 'unknown'))
if metric == default_metric:
metric_default = dataset_metrics[dataset_abbr][0]
scores.setdefault(default_metric, {})[dataset_abbr + '@' + metric_default] = \
parsed_results[model_abbr][dataset_abbr][metric_default]
eval_modes.append(dataset_eval_mode.get(dataset_abbr, 'unknown'))
else:
scores.setdefault(metric, {})[dataset_abbr + '@' + metric] = \
parsed_results[model_abbr][dataset_abbr][metric]
eval_modes.append(dataset_eval_mode.get(sg['subsets'][0], 'unknown'))
result = {}
for metric in scores:
if default_metric == 'standard_deviation':
avg = sum(scores[metric].values()) / len(scores[metric])
variance = sum((scores[metric][k] - avg) ** 2 for k in scores[metric]) / len(scores[metric])
scores[metric] = result[metric] = math.sqrt(variance)
elif default_metric == 'harmonic_mean':
# Check for non-positive values that would cause issues in harmonic mean
if any(scores[metric][k] <= 0 for k in scores[metric]):
self.logger.warning(f'Non-positive values found when calculating harmonic mean for {sg["name"]}')
# Handle non-positive values (either skip or use a small positive value)
numerator = len(scores[metric])
denominator = sum(1 / max(scores[metric][k], 1) for k in scores[metric])
else:
numerator = len(scores[metric])
denominator = sum(1 / scores[metric][k] for k in scores[metric])
scores[metric] = result[metric] = numerator / denominator
else:
if sg.get('weights', []):
# check sg['weights'][k] != 0 in case of scores[metric][k] is NaN

View File

@ -263,28 +263,34 @@ class OpenICLEvalTask(BaseTask):
if self.dump_details:
details = result.get('details', None)
try:
result['details'] = self.format_details(
pred_strs,
model_pred_strs,
test_set[self.output_column],
details,
model_details,
pred_dicts,
)
self.logger.warning(
f"result['details'] : {result['details']}"),
result['type'] = result['details'].pop('type', None)
if self.cal_extract_rate:
# Calculate the extraction success rate for prediction
result['extract_rate'] = self.extract_rate(result)
# Try to format details is details is not provided by evaluator
if details is None:
self.logger.info(
'Details is not give by evaluator, try to format it')
try:
result['details'] = self.format_details(
pred_strs,
model_pred_strs,
test_set[self.output_column],
details,
model_details,
pred_dicts,
)
self.logger.warning(
f"result['details'] : {result['details']}"),
result['type'] = result['details'].pop('type', None)
if self.cal_extract_rate:
# Calculate the extraction success
# rate for prediction
result['extract_rate'] = self.extract_rate(result)
if 'PPL' in str(
self.dataset_cfg.infer_cfg.inferencer.type):
result['correct_bpb'], result['incorrect_bpb'] = (
self.calculate_bpb(pred_dicts))
except Exception as e:
self.logger.warning(f'Skip dumping details due to: {e}.')
if 'PPL' in str(
self.dataset_cfg.infer_cfg.inferencer.type):
result['correct_bpb'], result['incorrect_bpb'] = (
self.calculate_bpb(pred_dicts))
except Exception as e:
self.logger.warning(
f'Skip dumping details due to: {e}.')
else:
result.pop('details', None)

View File

@ -33,6 +33,12 @@ DATASETS_MAPPING = {
"hf_id": "opencompass/bbh",
"local": "./data/BBH/data",
},
# bbeh
"opencompass/bbeh": {
"ms_id": "",
"hf_id": "",
"local": "./data/bbeh/",
},
# C-Eval
"opencompass/ceval-exam": {
"ms_id": "opencompass/ceval-exam",
@ -187,6 +193,12 @@ DATASETS_MAPPING = {
"hf_id": "",
"local": "./data/mmlu_pro",
},
# MultiPL-E
"opencompass/multipl_e": {
"ms_id": "",
"hf_id": "",
"local": "./data/multipl_e",
},
# NQ
"opencompass/natural_question": {
"ms_id": "opencompass/natural_question",
@ -303,6 +315,11 @@ DATASETS_MAPPING = {
"hf_id": "",
"local": "./data/aime.jsonl",
},
"opencompass/aime2025": {
"ms_id": "",
"hf_id": "",
"local": "./data/aime2025/aime2025.jsonl",
},
"opencompass/cmo_fib": {
"ms_id": "",
"hf_id": "",
@ -616,6 +633,11 @@ DATASETS_URL = {
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu_pro.zip",
"md5": "e3200c7380f4cea5f13c768f2815fabb",
},
"multipl_e": {
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/multipl_e.zip",
"md5": "24462aac7a38a4a62f5c5e89eb614e20",
},
"/Longbench": {
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/Longbench.zip",
@ -646,11 +668,16 @@ DATASETS_URL = {
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/test_generation.zip",
"md5": "918a6ea2b1eee6f2b1314db3c21cb4c7",
},
"/aime": {
"/aime2024": {
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip",
"md5": "fbe2d0577fc210962a549f8cea1a00c8",
},
"/aime2025": {
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime2025.zip",
"md5": "aa18cd5d2e2de246c5397f5eb1e61004",
},
"/cmo": {
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/cmo.zip",
@ -691,6 +718,10 @@ DATASETS_URL = {
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/korbench.zip",
"md5": "9107597d137e7362eaf7d218ddef7a6d",
},
"/bbeh": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/bbeh.zip",
"md5": "43a3c2d73aee731ac68ac790bc9a358e",
},
"subjective/judgerbench": {
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/judgerbench.zip",

View File

@ -276,13 +276,15 @@ def change_accelerator(models, accelerator):
if model.get(item) is not None:
acc_model[item] = model[item]
elif accelerator == 'vllm':
model_kwargs = dict(tensor_parallel_size=model['run_cfg']['num_gpus'], max_model_len=model.get('max_seq_len', None))
model_kwargs.update(model.get('model_kwargs'))
logger.info(f'Transforming {model["abbr"]} to {accelerator}')
acc_model = dict(
type=f'{VLLM.__module__}.{VLLM.__name__}',
abbr=model['abbr'].replace('hf', 'vllm') if '-hf' in model['abbr'] else model['abbr'] + '-vllm',
path=model['path'],
model_kwargs=dict(tensor_parallel_size=model['run_cfg']['num_gpus'], max_model_len=model.get('max_seq_len', None)),
model_kwargs=model_kwargs,
max_out_len=model['max_out_len'],
max_seq_len=model.get('max_seq_len', None),
batch_size=model['batch_size'],
@ -296,12 +298,14 @@ def change_accelerator(models, accelerator):
raise ValueError(f'Unsupported accelerator {accelerator} for model type {model["type"]}')
elif model['type'] in [HuggingFacewithChatTemplate, f'{HuggingFacewithChatTemplate.__module__}.{HuggingFacewithChatTemplate.__name__}']:
if accelerator == 'vllm':
model_kwargs = dict(tensor_parallel_size=model['run_cfg']['num_gpus'], max_model_len=model.get('max_seq_len', None))
model_kwargs.update(model.get('model_kwargs'))
mod = VLLMwithChatTemplate
acc_model = dict(
type=f'{mod.__module__}.{mod.__name__}',
abbr=model['abbr'].replace('hf', 'vllm') if '-hf' in model['abbr'] else model['abbr'] + '-vllm',
path=model['path'],
model_kwargs=dict(tensor_parallel_size=model['run_cfg']['num_gpus'], max_model_len=model.get('max_seq_len', None)),
model_kwargs=model_kwargs,
max_seq_len=model.get('max_seq_len', None),
max_out_len=model['max_out_len'],
batch_size=16,
@ -309,6 +313,14 @@ def change_accelerator(models, accelerator):
stop_words=model.get('stop_words', []),
)
elif accelerator == 'lmdeploy':
if model.get('generation_kwargs') is not None:
logger.warning(f'LMDeploy uses do_sample=False as default, and you need to set do_sample=True for sampling mode')
gen_config = model['generation_kwargs'].copy()
else:
logger.info('OpenCompass uses greedy decoding as default, you can set generation-kwargs for your purpose')
gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9)
mod = TurboMindModelwithChatTemplate
acc_model = dict(
type=f'{mod.__module__}.{mod.__name__}',
@ -320,7 +332,7 @@ def change_accelerator(models, accelerator):
session_len=model.get('max_seq_len', None),
max_new_tokens=model['max_out_len']
),
gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9),
gen_config=gen_config,
max_seq_len=model.get('max_seq_len', None),
max_out_len=model['max_out_len'],
batch_size=16,

View File

@ -12,7 +12,7 @@ faiss_gpu==1.7.2
# IFEval
langdetect
# TheoremQA
latex2sympy2
latex2sympy2==1.9.1
# Lawbench, leval
ltp
# Math