Merge branch 'main' into qwq32b

2025-05-30 16:03:24 +08:00 · 2025-03-24 11:30:28 +08:00 · 2025-03-24 11:30:28 +08:00 · d8b056cd77
commit d8b056cd77
parent f478f842cd 64128916d0
50 changed files with 3380 additions and 56 deletions
--- a/README.md
+++ b/README.md
@ -57,6 +57,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through

 ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

+- **\[2025.03.11\]** We have supported evaluation for `SuperGPQA` which is a great benchmark for measuring LLM knowledge ability 🔥🔥🔥
 - **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
 - **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
 - **\[2025.01.16\]** We now support the [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) model which has enhanced performance on reasoning and knowledge-intensive tasks.
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -57,6 +57,7 @@

 ## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

+- **\[2025.03.11\]** 现已支持 `SuperGPQA`  覆盖285 个研究生学科的知识能力评测，欢迎尝试！🔥🔥🔥
 - **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程，请查看 [评估推理模型](docs/en/user_guides/deepseek_r1.md) 了解更多详情！🔥🔥🔥
 - **\[2025.02.15\]** 我们新增了两个实用的评测工具：用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情！🔥🔥🔥
 - **\[2025.01.16\]** 我们现已支持 [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) 模型，该模型在推理、知识类任务上取得同量级最优性能，欢迎尝试。
--- a/dataset-index.yml
+++ b/dataset-index.yml
@ -234,6 +234,11 @@
    category: Reasoning
    paper: https://arxiv.org/pdf/2210.09261
    configpath: opencompass/configs/datasets/bbh
+- bbeh:
+    name: BIG-Bench Extra Hard
+    category: Reasoning
+    paper: https://arxiv.org/abs/2502.19187
+    configpath: opencompass/configs/datasets/bbeh
 - BoolQ:
    name: SuperGLUE / BoolQ
    category: Knowledge
@ -524,6 +529,11 @@
    category: Understanding
    paper: https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf
    configpath: opencompass/configs/datasets/SuperGLUE_MultiRC
+- multipl_e:
+    name: MultiPL-E
+    category: Code
+    paper: https://arxiv.org/pdf/2210.14868
+    configpath: opencompass/configs/datasets/multipl_e
 - narrativeqa:
    name: NarrativeQA
    category: Understanding
@ -734,6 +744,8 @@
    category: Understanding
    paper: https://arxiv.org/pdf/1808.08745
    configpath: opencompass/configs/datasets/Xsum
-
-
-
+- supergpqa:
+    name: SuperGPQA
+    category: Knowledge
+    paper: https://arxiv.org/pdf/2502.14739
+    configpath: opencompass/configs/datasets/supergpqa
--- a/docs/en/advanced_guides/llm_judge.md
+++ b/docs/en/advanced_guides/llm_judge.md
@ -34,6 +34,23 @@ problem,answer

 ## Configuration

+### Using LLM for Evaluation via Command Line
+
+Some datasets in OpenCompass already include LLM judge configurations.
+You need to use a model service (such as OpenAI or DeepSeek's official API) or start a model service locally using tools like LMDeploy, vLLM, or SGLang.
+
+Then, you can set the environment variables for the evaluation service and evaluate models using the following commands:
+
+```bash
+export OC_JUDGE_MODEL=Qwen/Qwen2.5-32B-Instruct
+export OC_JUDGE_API_KEY=sk-1234
+export OC_JUDGE_API_BASE=http://172.30.56.1:4000/v1
+```
+
+Note that by default, OpenCompass will use these three environment variables, but if you use configuration files to configure the evaluation service, these environment variables will not take effect.
+
+### ### Using LLM for Evaluation via Configuration Files
+
 To set up an LLM judge evaluation, you'll need to configure three main components:

 1. Dataset Reader Configuration
--- a/docs/zh_cn/advanced_guides/llm_judge.md
+++ b/docs/zh_cn/advanced_guides/llm_judge.md
@ -34,7 +34,24 @@ problem,answer

 ## 配置说明

-要设置LLM评判评估，你需要配置三个主要组件：
+### 基于命令行使用LLM进行评估
+
+OpenCompass中部分数据集已经包含了LLM评判器的配置。
+你需要使用一个模型服务（如OpenAI或DeepSeek官方提供的API）或本地使用LMDeploy、vLLM、SGLang等工具启动一个模型服务。
+
+然后，你可以通过以下命令设置相关评估服务的环境变量，并对模型进行评估：
+
+```bash
+export OC_JUDGE_MODEL=Qwen/Qwen2.5-32B-Instruct
+export OC_JUDGE_API_KEY=sk-1234
+export OC_JUDGE_API_BASE=http://172.30.56.1:4000/v1 
+```
+
+注意，默认情况下，OpenCompass会使用这三个环境变量，但如果你使用了基于配置文件的方式配置评估服务，这三个环境变量将不会生效。
+
+### 基于配置文件使用LLM进行评估
+
+对一个数据集设置LLM评判评估，你需要配置三个主要组件：

 1. 数据集读取配置

--- a/opencompass/configs/datasets/aime2024/aime2024_llmjudge_gen_5e9f4f.py
+++ b/opencompass/configs/datasets/aime2024/aime2024_llmjudge_gen_5e9f4f.py
@ -0,0 +1,90 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import CustomDataset
+from opencompass.evaluator import GenericLLMEvaluator
+from opencompass.datasets import generic_llmjudge_postprocess
+
+aime2024_reader_cfg = dict(input_columns=['question'], output_column='answer')
+
+
+aime2024_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role='HUMAN',
+                    prompt='{question}\nRemember to put your final answer within \\boxed{}.',
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+    5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+
+    <Original Question Begin>: \n{question}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    
+    Judging the correctness of candidates' answers:
+""".strip()
+
+aime2024_eval_cfg = dict(
+    evaluator=dict(
+        type=GenericLLMEvaluator,
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(
+                begin=[
+                    dict(
+                        role='SYSTEM',
+                        fallback_role='HUMAN',
+                        prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                    )
+                ],
+                round=[
+                    dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                ],
+            ),
+        ),
+        dataset_cfg=dict(
+            type=CustomDataset,
+            path='opencompass/aime2025',
+            reader_cfg=aime2024_reader_cfg,
+        ),
+        judge_cfg=dict(),
+        dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+    )
+)
+
+aime2024_datasets = [
+    dict(
+        abbr='aime2024',
+        type=CustomDataset,
+        path='opencompass/aime2025',
+        reader_cfg=aime2024_reader_cfg,
+        infer_cfg=aime2024_infer_cfg,
+        eval_cfg=aime2024_eval_cfg,
+    )
+]
--- a/opencompass/configs/datasets/aime2025/aime2025_llmjudge_gen_5e9f4f.py
+++ b/opencompass/configs/datasets/aime2025/aime2025_llmjudge_gen_5e9f4f.py
@ -0,0 +1,90 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import CustomDataset
+from opencompass.evaluator import GenericLLMEvaluator
+from opencompass.datasets import generic_llmjudge_postprocess
+
+aime2025_reader_cfg = dict(input_columns=['question'], output_column='answer')
+
+
+aime2025_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role='HUMAN',
+                    prompt='{question}\nRemember to put your final answer within \\boxed{}.',
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+    5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+
+    <Original Question Begin>: \n{question}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    
+    Judging the correctness of candidates' answers:
+""".strip()
+
+aime2025_eval_cfg = dict(
+    evaluator=dict(
+        type=GenericLLMEvaluator,
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(
+                begin=[
+                    dict(
+                        role='SYSTEM',
+                        fallback_role='HUMAN',
+                        prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                    )
+                ],
+                round=[
+                    dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                ],
+            ),
+        ),
+        dataset_cfg=dict(
+            type=CustomDataset,
+            path='opencompass/aime2025',
+            reader_cfg=aime2025_reader_cfg,
+        ),
+        judge_cfg=dict(),
+        dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+    ),
+)
+
+aime2025_datasets = [
+    dict(
+        type=CustomDataset,
+        abbr='aime2025',
+        path='opencompass/aime2025',
+        reader_cfg=aime2025_reader_cfg,
+        infer_cfg=aime2025_infer_cfg,
+        eval_cfg=aime2025_eval_cfg,
+    )
+]
--- a/opencompass/configs/datasets/bbeh/README.md
+++ b/opencompass/configs/datasets/bbeh/README.md
@ -0,0 +1,26 @@
+# BB#H
+
+```bash
+python3 run.py --models hf_internlm2_7b --datasets bbeh_gen --debug
+python3 run.py --models hf_meta_llama3_8b_instruct --datasets bbeh_gen --debug
+```
+
+## Models
+
+|                   model                    | score |
+|:-----------------------------------------:|------:|
+| Meta-Llama-3-8B-Instruct-LMDeploy-API     | 10.93 |
+
+### Details
+
+|                   model                    | boolean_expressions | disambiguation_qa | geometric_shapes | hyperbaton | movie_recommendation | nycc | shuffled_objects | boardgame_qa |
+|:-----------------------------------------:|--------------------:|------------------:|-----------------:|-----------:|---------------------:|-----:|-----------------:|-------------:|
+| Meta-Llama-3-8B-Instruct-LMDeploy-API     |               14.00 |             33.33 |            13.50 |       1.00 |               28.00 | 11.00 |            10.00 |        18.50 |
+
+|                   model                    | buggy_tables | causal_understanding | dyck_languages | linguini | multistep_arithmetic | object_counting | object_properties | sarc_triples |
+|:-----------------------------------------:|-------------:|---------------------:|---------------:|---------:|---------------------:|----------------:|------------------:|-------------:|
+| Meta-Llama-3-8B-Instruct-LMDeploy-API     |         0.00 |               42.50 |           3.50 |     2.00 |                 0.00 |            0.00 |              1.00 |        17.00 |
+
+|                   model                    | spatial_reasoning | sportqa | temporal_sequence | time_arithmetic | web_of_lies | word_sorting | zebra_puzzles |
+|:-----------------------------------------:|------------------:|-------:|-----------------:|----------------:|------------:|-------------:|--------------:|
+| Meta-Llama-3-8B-Instruct-LMDeploy-API     |              4.00 |   5.00 |             2.00 |            3.00 |        7.50 |         2.00 |          3.50 |
--- a/opencompass/configs/datasets/bbeh/bbeh_gen.py
+++ b/opencompass/configs/datasets/bbeh/bbeh_gen.py
@ -0,0 +1,93 @@
+import os
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.openicl.icl_evaluator import AccEvaluator
+from opencompass.datasets import BBEHDataset, BBEHEvaluator, bbeh_mcq_postprocess, BBEHEvaluator_mcq
+
+bbeh_reader_cfg = dict(input_columns=['input'], output_column='target')
+
+
+bbeh_multiple_choice_sets = [
+    'bbeh_boolean_expressions',
+    'bbeh_disambiguation_qa',
+    'bbeh_geometric_shapes',
+    'bbeh_hyperbaton',
+    'bbeh_movie_recommendation',
+    'bbeh_nycc',
+    'bbeh_shuffled_objects',
+]
+
+bbeh_free_form_sets = [
+    'bbeh_boardgame_qa',
+    'bbeh_buggy_tables',
+    'bbeh_causal_understanding',
+    'bbeh_dyck_languages',
+    'bbeh_linguini',
+    'bbeh_multistep_arithmetic',
+    'bbeh_object_counting',
+    'bbeh_object_properties',
+    'bbeh_sarc_triples',
+    'bbeh_spatial_reasoning',
+    'bbeh_sportqa',
+    'bbeh_temporal_sequence',
+    'bbeh_time_arithmetic',
+    'bbeh_web_of_lies',
+    'bbeh_word_sorting',
+    'bbeh_zebra_puzzles',
+]
+
+bbeh_datasets = []
+for _name in bbeh_multiple_choice_sets:
+    bbeh_infer_cfg = dict(
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(round=[
+                dict(
+                    role='HUMAN',
+                    prompt=
+                    f"Think step by step, and when you provide the final answer, please use the prefix \"The answer is:\"without any modification, and provide the answer directly, with no formatting, no bolding, and no markup. For instance: \"The answer is: 42\" or \"The answer is: yes\". If the question is multiple choice with a single correct answer, the final answer must only be the letter corresponding to the correct answer. For example, \"The answer is: (a)\"\n\nQ: {{input}}\nA: "
+                )
+            ])),
+        retriever=dict(type=ZeroRetriever),
+        inferencer=dict(type=GenInferencer, max_out_len=8192))
+    bbeh_eval_cfg = dict(
+        evaluator=dict(type=BBEHEvaluator_mcq),
+        pred_role='BOT',
+        pred_postprocessor=dict(type=bbeh_mcq_postprocess),
+        dataset_postprocessor=dict(type=bbeh_mcq_postprocess))
+
+    bbeh_datasets.append(
+        dict(
+            type=BBEHDataset,
+            path='opencompass/bbeh',
+            name=_name,
+            abbr=_name,
+            reader_cfg=bbeh_reader_cfg,
+            infer_cfg=bbeh_infer_cfg.copy(),
+            eval_cfg=bbeh_eval_cfg.copy()))
+
+for _name in bbeh_free_form_sets:
+    bbeh_infer_cfg = dict(
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(round=[
+                dict(
+                    role='HUMAN',
+                    prompt=
+                    f"Think step by step, and when you provide the final answer, please use the prefix \"The answer is:\"without any modification, and provide the answer directly, with no formatting, no bolding, and no markup. For instance: \"The answer is: 42\" or \"The answer is: yes\". If the question is multiple choice with a single correct answer, the final answer must only be the letter corresponding to the correct answer. For example, \"The answer is: (a)\"\n\nQ: {{input}}\nA: "
+                )
+            ])),
+        retriever=dict(type=ZeroRetriever),
+        inferencer=dict(type=GenInferencer, max_out_len=8192))
+    bbeh_eval_cfg = dict(evaluator=dict(type=BBEHEvaluator), pred_role='BOT', pred_postprocessor=dict(type=bbeh_mcq_postprocess), dataset_postprocessor=dict(type=bbeh_mcq_postprocess))
+
+    bbeh_datasets.append(
+        dict(
+            type=BBEHDataset,
+            path='opencompass/bbeh',
+            name=_name,
+            abbr=_name,
+            reader_cfg=bbeh_reader_cfg,
+            infer_cfg=bbeh_infer_cfg.copy(),
+            eval_cfg=bbeh_eval_cfg.copy()))
--- a/opencompass/configs/datasets/bbeh/bbeh_llmjudge_gen_86c3a0.py
+++ b/opencompass/configs/datasets/bbeh/bbeh_llmjudge_gen_86c3a0.py
@ -0,0 +1,126 @@
+import os
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import (
+    BBEHDataset,
+    generic_llmjudge_postprocess,
+)
+from opencompass.evaluator import GenericLLMEvaluator
+
+bbeh_reader_cfg = dict(input_columns=['input'], output_column='target')
+
+
+bbeh_multiple_choice_sets = [
+    'bbeh_boolean_expressions',
+    'bbeh_disambiguation_qa',
+    'bbeh_geometric_shapes',
+    'bbeh_hyperbaton',
+    'bbeh_movie_recommendation',
+    'bbeh_nycc',
+    'bbeh_shuffled_objects',
+]
+
+bbeh_free_form_sets = [
+    'bbeh_boardgame_qa',
+    'bbeh_buggy_tables',
+    'bbeh_causal_understanding',
+    'bbeh_dyck_languages',
+    'bbeh_linguini',
+    'bbeh_multistep_arithmetic',
+    'bbeh_object_counting',
+    'bbeh_object_properties',
+    'bbeh_sarc_triples',
+    'bbeh_spatial_reasoning',
+    'bbeh_sportqa',
+    'bbeh_temporal_sequence',
+    'bbeh_time_arithmetic',
+    'bbeh_web_of_lies',
+    'bbeh_word_sorting',
+    'bbeh_zebra_puzzles',
+]
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
+
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+    5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+
+    <Original Question Begin>: \n{input}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{target}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+
+    Judging the correctness of candidates' answers:
+""".strip()
+
+bbeh_datasets = []
+for _name in bbeh_multiple_choice_sets + bbeh_free_form_sets:
+    bbeh_infer_cfg = dict(
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(
+                round=[
+                    dict(
+                        role='HUMAN',
+                        prompt=f"Think step by step, and when you provide the final answer, please use the prefix \"The answer is:\"without any modification, and provide the answer directly, with no formatting, no bolding, and no markup. For instance: \"The answer is: 42\" or \"The answer is: yes\". If the question is multiple choice with a single correct answer, the final answer must only be the letter corresponding to the correct answer. For example, \"The answer is: (a)\"\n\nQ: {{input}}\nA: ",
+                    )
+                ]
+            ),
+        ),
+        retriever=dict(type=ZeroRetriever),
+        inferencer=dict(type=GenInferencer),
+    )
+    bbeh_eval_cfg = dict(
+        evaluator=dict(
+            type=GenericLLMEvaluator,
+            prompt_template=dict(
+                type=PromptTemplate,
+                template=dict(
+                    begin=[
+                        dict(
+                            role='SYSTEM',
+                            fallback_role='HUMAN',
+                            prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                        )
+                    ],
+                    round=[
+                        dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                    ],
+                ),
+            ),
+            dataset_cfg=dict(
+                type=BBEHDataset,
+                path='opencompass/bbeh',
+                name=_name,
+                abbr=_name,
+                reader_cfg=bbeh_reader_cfg,
+            ),
+            judge_cfg=dict(),
+            dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+        ),
+        pred_role='BOT',
+    )
+
+    bbeh_datasets.append(
+        dict(
+            type=BBEHDataset,
+            path='opencompass/bbeh',
+            name=_name,
+            abbr=_name,
+            reader_cfg=bbeh_reader_cfg,
+            infer_cfg=bbeh_infer_cfg,
+            eval_cfg=bbeh_eval_cfg,
+        )
+    )
--- a/opencompass/configs/datasets/cmmlu/cmmlu_llmjudge_gen_e1cd9a.py
+++ b/opencompass/configs/datasets/cmmlu/cmmlu_llmjudge_gen_e1cd9a.py
@ -0,0 +1,185 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.openicl.icl_evaluator import AccEvaluator
+from opencompass.datasets import CMMLUDataset
+from opencompass.utils.text_postprocessors import match_answer_pattern
+from opencompass.evaluator import GenericLLMEvaluator
+from opencompass.datasets import generic_llmjudge_postprocess
+
+cmmlu_subject_mapping = {
+    'agronomy': '农学',
+    'anatomy': '解剖学',
+    'ancient_chinese': '古汉语',
+    'arts': '艺术学',
+    'astronomy': '天文学',
+    'business_ethics': '商业伦理',
+    'chinese_civil_service_exam': '中国公务员考试',
+    'chinese_driving_rule': '中国驾驶规则',
+    'chinese_food_culture': '中国饮食文化',
+    'chinese_foreign_policy': '中国外交政策',
+    'chinese_history': '中国历史',
+    'chinese_literature': '中国文学',
+    'chinese_teacher_qualification': '中国教师资格',
+    'clinical_knowledge': '临床知识',
+    'college_actuarial_science': '大学精算学',
+    'college_education': '大学教育学',
+    'college_engineering_hydrology': '大学工程水文学',
+    'college_law': '大学法律',
+    'college_mathematics': '大学数学',
+    'college_medical_statistics': '大学医学统计',
+    'college_medicine': '大学医学',
+    'computer_science': '计算机科学',
+    'computer_security': '计算机安全',
+    'conceptual_physics': '概念物理学',
+    'construction_project_management': '建设工程管理',
+    'economics': '经济学',
+    'education': '教育学',
+    'electrical_engineering': '电气工程',
+    'elementary_chinese': '小学语文',
+    'elementary_commonsense': '小学常识',
+    'elementary_information_and_technology': '小学信息技术',
+    'elementary_mathematics': '初等数学',
+    'ethnology': '民族学',
+    'food_science': '食品科学',
+    'genetics': '遗传学',
+    'global_facts': '全球事实',
+    'high_school_biology': '高中生物',
+    'high_school_chemistry': '高中化学',
+    'high_school_geography': '高中地理',
+    'high_school_mathematics': '高中数学',
+    'high_school_physics': '高中物理学',
+    'high_school_politics': '高中政治',
+    'human_sexuality': '人类性行为',
+    'international_law': '国际法学',
+    'journalism': '新闻学',
+    'jurisprudence': '法理学',
+    'legal_and_moral_basis': '法律与道德基础',
+    'logical': '逻辑学',
+    'machine_learning': '机器学习',
+    'management': '管理学',
+    'marketing': '市场营销',
+    'marxist_theory': '马克思主义理论',
+    'modern_chinese': '现代汉语',
+    'nutrition': '营养学',
+    'philosophy': '哲学',
+    'professional_accounting': '专业会计',
+    'professional_law': '专业法学',
+    'professional_medicine': '专业医学',
+    'professional_psychology': '专业心理学',
+    'public_relations': '公共关系',
+    'security_study': '安全研究',
+    'sociology': '社会学',
+    'sports_science': '体育学',
+    'traditional_chinese_medicine': '中医中药',
+    'virology': '病毒学',
+    'world_history': '世界历史',
+    'world_religions': '世界宗教',
+}
+
+QUERY_TEMPLATE = """
+你回答的最后一行**必须**是以下格式 '答案: $选项' (不带引号), 其中选项是ABCD之一.
+
+{question}
+
+A) {A}
+B) {B}
+C) {C}
+D) {D}
+""".strip()
+
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+    <Original Question Begin>: \n {question}\n A) {A}\n B) {B}\n C) {C}\n D) {D}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    Judging the correctness of candidates' answers:
+""".strip()
+
+cmmlu_all_sets = list(cmmlu_subject_mapping.keys())
+
+cmmlu_datasets = []
+for _name in cmmlu_all_sets:
+    _ch_name = cmmlu_subject_mapping[_name]
+    prompt_prefix = f'请回答以下关于{_ch_name}的单项选择题, '
+    cmmlu_infer_cfg = dict(
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(
+                round=[
+                    dict(role='HUMAN', prompt=prompt_prefix + QUERY_TEMPLATE),
+                ],
+            ),
+        ),
+        retriever=dict(type=ZeroRetriever),
+        inferencer=dict(type=GenInferencer),
+    )
+
+    cmmlu_eval_cfg = dict(
+        evaluator=dict(
+            type=GenericLLMEvaluator,
+            prompt_template=dict(
+                type=PromptTemplate,
+                template=dict(
+                    begin=[
+                        dict(
+                            role='SYSTEM',
+                            fallback_role='HUMAN',
+                            prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                        )
+                    ],
+                    round=[
+                        dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                    ],
+                ),
+            ),
+            dataset_cfg=dict(
+                type=CMMLUDataset,
+                path='opencompass/cmmlu',
+                name=_name,
+                reader_cfg=dict(
+                    input_columns=['question', 'A', 'B', 'C', 'D'],
+                    output_column='answer',
+                    train_split='dev',
+                    test_split='test',
+                ),
+            ),
+            judge_cfg=dict(),
+            dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+        ),
+        pred_role='BOT',
+    )
+    cmmlu_datasets.append(
+        dict(
+            type=CMMLUDataset,
+            path='opencompass/cmmlu',
+            name=_name,
+            abbr=f'cmmlu-{_name}',
+            reader_cfg=dict(
+                input_columns=['question', 'A', 'B', 'C', 'D'],
+                output_column='answer',
+                train_split='dev',
+                test_split='test',
+            ),
+            infer_cfg=cmmlu_infer_cfg,
+            eval_cfg=cmmlu_eval_cfg,
+            mode='singlescore',
+        )
+    )
+
+del _name, _ch_name
--- a/opencompass/configs/datasets/drop/drop_llmjudge_gen_3857b0.py
+++ b/opencompass/configs/datasets/drop/drop_llmjudge_gen_3857b0.py
@ -0,0 +1,89 @@
+from mmengine.config import read_base
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import DropOpenAIDataset
+from opencompass.evaluator import GenericLLMEvaluator
+from opencompass.datasets import generic_llmjudge_postprocess
+
+with read_base():
+    from .drop_examples import drop_examples  # noqa: F401, F403
+
+drop_reader_cfg = dict(
+    input_columns=['prompt'],
+    output_column='answers',
+    train_split='validation',
+    test_split='validation',
+)
+
+template = f'You will be asked to read a passage and answer a question. Some examples of passages and Q&A are provided below.\n\n{drop_examples}\n\n# Your Task\n\n---\n{{prompt}}\n\nThink step by step, then write a line of the form "Answer: $ANSWER" at the end of your response.'
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+    <Original Question Begin>: {prompt}\n \n<Original Question End>\n\n
+    <Gold Target Begin>: \n{answers}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    Judging the correctness of candidates' answers:
+""".strip()
+
+drop_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(round=[dict(role='HUMAN', prompt=template)]),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+drop_eval_cfg = dict(
+    evaluator=dict(
+        type=GenericLLMEvaluator,
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(
+                begin=[
+                    dict(
+                        role='SYSTEM',
+                        fallback_role='HUMAN',
+                        prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                    )
+                ],
+                round=[
+                    dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                ],
+            ),
+        ),
+        dataset_cfg=dict(
+            type=DropOpenAIDataset,
+            path='data/drop_simple_eval/dev.jsonl',
+            reader_cfg=drop_reader_cfg,
+        ),
+        judge_cfg=dict(),
+        dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+    ),
+    pred_role='BOT',
+)
+drop_datasets = [
+    dict(
+        abbr='drop',
+        type=DropOpenAIDataset,
+        path='data/drop_simple_eval/dev.jsonl',
+        reader_cfg=drop_reader_cfg,
+        infer_cfg=drop_infer_cfg,
+        eval_cfg=drop_eval_cfg,
+    )
+]
--- a/opencompass/configs/datasets/hellaswag/hellaswag_llmjudge_gen_809ef1.py
+++ b/opencompass/configs/datasets/hellaswag/hellaswag_llmjudge_gen_809ef1.py
@ -0,0 +1,97 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.openicl.icl_evaluator import AccwithDetailsEvaluator
+from opencompass.datasets import HellaswagDatasetwithICE
+from opencompass.utils.text_postprocessors import first_option_postprocess
+from opencompass.evaluator import GenericLLMEvaluator
+from opencompass.datasets import generic_llmjudge_postprocess
+
+hellaswag_reader_cfg = dict(
+    input_columns=['ctx', 'A', 'B', 'C', 'D'],
+    output_column='label',
+    train_split='train',
+    test_split='val',
+)
+
+align_prompt = """Continue the following text without adding any additional information or formatting:
+{ctx}
+A) {A}
+B) {B}
+C) {C}
+D) {D}
+What is the right option?'"""
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+    <Original Question Begin>: {ctx}\n A) {A}\n B) {B}\n C) {C}\n D) {D}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{label}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    Judging the correctness of candidates' answers:
+""".strip()
+
+hellaswag_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(role='HUMAN', prompt=align_prompt),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+hellaswag_eval_cfg = dict(
+    evaluator=dict(
+        type=GenericLLMEvaluator,
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(
+                begin=[
+                    dict(
+                        role='SYSTEM',
+                        fallback_role='HUMAN',
+                        prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                    )
+                ],
+                round=[
+                    dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                ],
+            ),
+        ),
+        dataset_cfg=dict(
+            type=HellaswagDatasetwithICE,
+            path='opencompass/hellaswag_ice',
+            reader_cfg=hellaswag_reader_cfg,
+        ),
+        judge_cfg=dict(),
+        dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+    ),
+)
+
+hellaswag_datasets = [
+    dict(
+        abbr='hellaswag',
+        type=HellaswagDatasetwithICE,
+        path='opencompass/hellaswag_ice',
+        reader_cfg=hellaswag_reader_cfg,
+        infer_cfg=hellaswag_infer_cfg,
+        eval_cfg=hellaswag_eval_cfg,
+    )
+]
--- a/opencompass/configs/datasets/mmlu/mmlu_llmjudge_gen_f4336b.py
+++ b/opencompass/configs/datasets/mmlu/mmlu_llmjudge_gen_f4336b.py
@ -0,0 +1,111 @@
+from mmengine.config import read_base
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.openicl.icl_evaluator import AccEvaluator
+from opencompass.datasets import MMLUDataset
+from opencompass.utils.text_postprocessors import match_answer_pattern
+from opencompass.evaluator import GenericLLMEvaluator
+from opencompass.datasets import generic_llmjudge_postprocess
+
+with read_base():
+    from .mmlu_all_sets import mmlu_all_sets
+# None of the mmlu dataset in huggingface is correctly parsed, so we use our own dataset reader
+# Please download the dataset from https://people.eecs.berkeley.edu/~hendrycks/data.tar
+
+QUERY_TEMPLATE = """
+Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of ABCD. 
+
+{input}
+
+A) {A}
+B) {B}
+C) {C}
+D) {D}
+""".strip()
+
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+    <Original Question Begin>: {input}\n A) {A}\n B) {B}\n C) {C}\n D) {D}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{target}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    Judging the correctness of candidates' answers:
+""".strip()
+
+mmlu_reader_cfg = dict(
+    input_columns=['input', 'A', 'B', 'C', 'D'],
+    output_column='target',
+    train_split='dev',
+)
+
+mmlu_datasets = []
+for name in mmlu_all_sets:
+    mmlu_infer_cfg = dict(
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(
+                round=[
+                    dict(role='HUMAN', prompt=QUERY_TEMPLATE),
+                ],
+            ),
+        ),
+        retriever=dict(type=ZeroRetriever),
+        inferencer=dict(type=GenInferencer),
+    )
+
+    mmlu_eval_cfg = dict(
+        evaluator=dict(
+            type=GenericLLMEvaluator,
+            prompt_template=dict(
+                type=PromptTemplate,
+                template=dict(
+                    begin=[
+                        dict(
+                            role='SYSTEM',
+                            fallback_role='HUMAN',
+                            prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                        )
+                    ],
+                    round=[
+                        dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                    ],
+                ),
+            ),
+            dataset_cfg=dict(
+                type=MMLUDataset,
+                path='opencompass/mmlu',
+                name=name,
+                reader_cfg=mmlu_reader_cfg,
+            ),
+            judge_cfg=dict(),
+            dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+        ),
+        pred_role='BOT',
+    )
+    mmlu_datasets.append(
+        dict(
+            abbr=f'lukaemon_mmlu_{name}',
+            type=MMLUDataset,
+            path='opencompass/mmlu',
+            name=name,
+            reader_cfg=mmlu_reader_cfg,
+            infer_cfg=mmlu_infer_cfg,
+            eval_cfg=mmlu_eval_cfg,
+            mode='singlescore',
+        )
+    )
--- a/opencompass/configs/datasets/mmlu_pro/mmlu_pro_0shot_nocot_genericllmeval_gen_08c1de.py
+++ b/opencompass/configs/datasets/mmlu_pro/mmlu_pro_0shot_nocot_genericllmeval_gen_08c1de.py
@ -0,0 +1,106 @@
+from mmengine.config import read_base
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.evaluator import GenericLLMEvaluator
+from opencompass.datasets import MMLUProDataset, generic_llmjudge_postprocess
+
+with read_base():
+    from .mmlu_pro_categories import categories
+
+
+QUERY_TEMPLATE = """
+Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of Options(e.g. one of ABCDEFGHIJKLMNOP). Think step by step before answering.
+
+Question:\n
+{question}
+
+Options:\n
+{options_str}
+
+""".strip()
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+    <Original Question Begin>: {question}\n {options_str} \n<Original Question End>\n\n
+    <Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    Judging the correctness of candidates' answers:
+""".strip()
+
+mmlu_pro_datasets = []
+
+for category in categories:
+    mmlu_pro_reader_cfg = dict(
+        input_columns=['question', 'cot_content', 'options_str'],
+        output_column='answer',
+        train_split='validation',
+        test_split='test',
+    )
+    mmlu_pro_infer_cfg = dict(
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(
+                round=[
+                    dict(role='HUMAN', prompt=QUERY_TEMPLATE),
+                ],
+            ),
+        ),
+        retriever=dict(type=ZeroRetriever),
+        inferencer=dict(type=GenInferencer),
+    )
+
+    mmlu_pro_eval_cfg = dict(
+        evaluator=dict(
+            type=GenericLLMEvaluator,
+            prompt_template=dict(
+                type=PromptTemplate,
+                template=dict(
+                    begin=[
+                        dict(
+                            role='SYSTEM',
+                            fallback_role='HUMAN',
+                            prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                        )
+                    ],
+                    round=[
+                        dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                    ],
+                ),
+            ),
+            dataset_cfg=dict(
+                type=MMLUProDataset,
+                path='opencompass/mmlu_pro',
+                category=category,
+                reader_cfg=mmlu_pro_reader_cfg,
+            ),
+            judge_cfg=dict(),
+            dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+        ),
+    )
+
+    mmlu_pro_datasets.append(
+        dict(
+            abbr=f'mmlu_pro_{category.replace(" ", "_")}',
+            type=MMLUProDataset,
+            path='opencompass/mmlu_pro',
+            category=category,
+            reader_cfg=mmlu_pro_reader_cfg,
+            infer_cfg=mmlu_pro_infer_cfg,
+            eval_cfg=mmlu_pro_eval_cfg,
+        )
+    )
--- a/opencompass/configs/datasets/multipl_e/multiple_top_ten_gen.py
+++ b/opencompass/configs/datasets/multipl_e/multiple_top_ten_gen.py
@ -0,0 +1,56 @@
+# Select the 10 most popular programming languages from MultiPL-E to compose the test set.
+
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import MultiplEDataset, MultiplEEvaluator
+
+
+_TOP_TEN_LANGUAGE_ = ['cpp', 'cs', 'go', 'java', 'rb', 'js', 'php', 'r', 'rs', 'sh']
+
+multiple_reader_cfg = dict(input_columns=['language', 'prompt'], output_column='tests')
+
+multiple_infer_cfg = dict(
+    prompt_template=dict(type=PromptTemplate, template='Based on the provided {language} code snippet, complete the subsequent content. The initial part of the completed code must match the provided code snippet exactly:\n{prompt}'),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+multiple_eval_cfg = {
+    lang: dict(
+        evaluator=dict(
+            type=MultiplEEvaluator,
+            language=lang,
+            ip_address='https://opencompass-multiple-evaluator.hf.space',
+        ),
+        pred_role='BOT',
+    ) for lang in _TOP_TEN_LANGUAGE_
+}
+
+multiple_datasets = [
+    dict(
+        type=MultiplEDataset,
+        abbr=f'humaneval-multiple-{lang}',
+        language=lang,
+        num_repeats=1,
+        path='opencompass/multipl_e',
+        tag='humaneval',
+        reader_cfg=multiple_reader_cfg,
+        infer_cfg=multiple_infer_cfg,
+        eval_cfg=multiple_eval_cfg[lang],
+    ) for lang in _TOP_TEN_LANGUAGE_
+]
+
+multiple_datasets += [
+    dict(
+        type=MultiplEDataset,
+        abbr=f'mbpp-multiple-{lang}',
+        language=lang,
+        num_repeats=1,
+        path='opencompass/multipl_e',
+        tag='mbpp',
+        reader_cfg=multiple_reader_cfg,
+        infer_cfg=multiple_infer_cfg,
+        eval_cfg=multiple_eval_cfg[lang],
+    ) for lang in _TOP_TEN_LANGUAGE_
+]
--- a/opencompass/configs/datasets/musr/musr_llmjudge_gen_b47fd3.py
+++ b/opencompass/configs/datasets/musr/musr_llmjudge_gen_b47fd3.py
@ -0,0 +1,131 @@
+from opencompass.datasets import MusrDataset, generic_llmjudge_postprocess
+from opencompass.evaluator import GenericLLMEvaluator
+from opencompass.openicl import PromptTemplate, ZeroRetriever, GenInferencer
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+    5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+
+    <Original Question Begin>: {system_prompt}\n{prompt}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{gold_answer}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    
+    Judging the correctness of candidates' answers:
+""".strip()
+
+# Common configuration components
+reader_cfg = dict(
+    input_columns=[
+        'context',
+        'question_text',
+        'question',
+        'answer',
+        'choices',
+        'choices_str',
+        'intermediate_trees',
+        'intermediate_data',
+        'prompt',
+        'system_prompt',
+        'gold_answer',
+        'scidx',
+        'self_consistency_n',
+        'ablation_name',
+    ],
+    output_column='gold_answer',
+)
+
+infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            begin=[
+                dict(
+                    role='SYSTEM',
+                    fallback_role='HUMAN',
+                    prompt='{system_prompt}',
+                )
+            ],
+            round=[
+                dict(role='HUMAN', prompt='{prompt}'),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+
+# Dataset configurations
+DATASET_CONFIGS = {
+    'murder_mysteries': {
+        'abbr': 'musr_murder_mysteries',
+        'name': 'murder_mysteries',
+        'path': 'opencompass/musr',
+    },
+    'object_placements': {
+        'abbr': 'musr_object_placements',
+        'name': 'object_placements',
+        'path': 'opencompass/musr',
+    },
+    'team_allocation': {
+        'abbr': 'musr_team_allocation',
+        'name': 'team_allocation',
+        'path': 'opencompass/musr',
+    },
+}
+
+# Create dataset configurations
+musr_datasets = []
+
+for config in DATASET_CONFIGS.values():
+    dataset = dict(
+        abbr=config['abbr'],
+        type=MusrDataset,
+        path=config['path'],
+        name=config['name'],
+        reader_cfg=reader_cfg,
+        infer_cfg=infer_cfg,
+        eval_cfg=dict(
+            evaluator=dict(
+                type=GenericLLMEvaluator,
+                prompt_template=dict(
+                    type=PromptTemplate,
+                    template=dict(
+                        begin=[
+                            dict(
+                                role='SYSTEM',
+                                fallback_role='HUMAN',
+                                prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                            )
+                        ],
+                        round=[
+                            dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                        ],
+                    ),
+                ),
+                dataset_cfg=dict(
+                    type=MusrDataset,
+                    path=config['path'],
+                    name=config['name'],
+                    reader_cfg=reader_cfg,
+                ),
+                judge_cfg=dict(),
+                dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+            ),
+        ),
+    )
+    musr_datasets.append(dataset)
--- a/opencompass/configs/datasets/supergpqa/supergpqa_gen.py
+++ b/opencompass/configs/datasets/supergpqa/supergpqa_gen.py
@ -0,0 +1,57 @@
+from opencompass.datasets.supergpqa.supergpqa import (
+    SuperGPQADataset,
+    SuperGPQAEvaluator,
+)
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+
+
+# Reader configuration
+reader_cfg = dict(
+    input_columns=[
+        'question',
+        'options',
+        'discipline',
+        'field',
+        'subfield',
+        'difficulty',
+        'infer_prompt',
+        'prompt_mode',
+    ],
+    output_column='answer_letter',
+)
+
+# Inference configuration
+infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role='HUMAN',
+                    prompt='{infer_prompt}',
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+# Evaluation configuration
+eval_cfg = dict(
+    evaluator=dict(type=SuperGPQAEvaluator),
+    pred_role='BOT',
+)
+supergpqa_dataset = dict(
+    type=SuperGPQADataset,
+    abbr='supergpqa',
+    path='m-a-p/SuperGPQA',
+    prompt_mode='zero-shot',
+    reader_cfg=reader_cfg,
+    infer_cfg=infer_cfg,
+    eval_cfg=eval_cfg,
+)
+
+supergpqa_datasets = [supergpqa_dataset]
--- a/opencompass/configs/datasets/supergpqa/supergpqa_llmjudge_gen_12b8bc.py
+++ b/opencompass/configs/datasets/supergpqa/supergpqa_llmjudge_gen_12b8bc.py
@ -0,0 +1,103 @@
+from opencompass.datasets.supergpqa.supergpqa import (
+    SuperGPQADataset,
+)
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.evaluator import GenericLLMEvaluator
+from opencompass.datasets import generic_llmjudge_postprocess
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+    <Original Question Begin>: {infer_prompt}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{answer_letter}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    Judging the correctness of candidates' answers:
+""".strip()
+
+# Reader configuration
+reader_cfg = dict(
+    input_columns=[
+        'question',
+        'options',
+        'discipline',
+        'field',
+        'subfield',
+        'difficulty',
+        'infer_prompt',
+        'prompt_mode',
+    ],
+    output_column='answer_letter',
+)
+
+# Inference configuration
+infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role='HUMAN',
+                    prompt='{infer_prompt}',
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+# Evaluation configuration
+eval_cfg = dict(
+    evaluator=dict(
+        type=GenericLLMEvaluator,
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(
+                begin=[
+                    dict(
+                        role='SYSTEM',
+                        fallback_role='HUMAN',
+                        prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                    )
+                ],
+                round=[
+                    dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                ],
+            ),
+        ),
+        dataset_cfg=dict(
+            type=SuperGPQADataset,
+            path='m-a-p/SuperGPQA',
+            prompt_mode='zero-shot',
+            reader_cfg=reader_cfg,
+        ),
+        judge_cfg=dict(),
+        dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+    ),
+)
+supergpqa_dataset = dict(
+    type=SuperGPQADataset,
+    abbr='supergpqa',
+    path='m-a-p/SuperGPQA',
+    prompt_mode='zero-shot',
+    reader_cfg=reader_cfg,
+    infer_cfg=infer_cfg,
+    eval_cfg=eval_cfg,
+)
+
+supergpqa_datasets = [supergpqa_dataset]
--- a/opencompass/configs/models/phi/hf_phi_4.py
+++ b/opencompass/configs/models/phi/hf_phi_4.py
@ -9,4 +9,4 @@ models = [
        batch_size=8,
        run_cfg=dict(num_gpus=2),
    )
-]
+]
--- a/opencompass/configs/summarizers/groups/OlympiadBench.py
+++ b/opencompass/configs/summarizers/groups/OlympiadBench.py
@ -16,7 +16,17 @@ math_categories = [
    'OE_TO_maths_zh_CEE', # OpenEnded - TextOnly - maths - CEE
 ]

+physics_categories = [
+    'OE_TO_physics_en_COMP', # OpenEnded - TextOnly - physics - COMP
+    'OE_TO_physics_zh_CEE' # OpenEnded - TextOnly - physics - CEE
+]
+

 OlympiadBenchMath_summary_groups = [
    {'name': 'OlympiadBenchMath', 'subsets': ['OlympiadBench_' + c.replace(' ', '_') for c in math_categories]},
 ]
+
+
+OlympiadBenchPhysics_summary_groups = [
+    {'name': 'OlympiadBenchPhysics', 'subsets': ['OlympiadBench_' + c.replace(' ', '_') for c in physics_categories]},
+]
--- a/opencompass/configs/summarizers/groups/bbeh.py
+++ b/opencompass/configs/summarizers/groups/bbeh.py
@ -0,0 +1,13 @@
+bbeh_summary_groups = []
+
+# bbeh
+_bbeh = [
+    'bbeh_boolean_expressions', 'bbeh_disambiguation_qa', 'bbeh_geometric_shapes', 'bbeh_hyperbaton',
+    'bbeh_movie_recommendation', 'bbeh_nycc', 'bbeh_shuffled_objects', 'bbeh_boardgame_qa',
+    'bbeh_buggy_tables', 'bbeh_causal_understanding', 'bbeh_dyck_languages', 'bbeh_linguini',
+    'bbeh_multistep_arithmetic', 'bbeh_object_counting', 'bbeh_object_properties', 'bbeh_sarc_triples',
+    'bbeh_spatial_reasoning', 'bbeh_sportqa', 'bbeh_temporal_sequence', 'bbeh_time_arithmetic',
+    'bbeh_web_of_lies', 'bbeh_word_sorting', 'bbeh_zebra_puzzles'
+]
+bbeh_summary_groups.append({'name': 'bbeh', 'subsets': _bbeh, 'metric':'naive_average'})
+bbeh_summary_groups.append({'name': 'bbeh', 'subsets': _bbeh, 'metric':'harmonic_mean'})
--- a/opencompass/datasets/init.py
+++ b/opencompass/datasets/init.py
@ -9,6 +9,7 @@ from .arc import *  # noqa: F401, F403
 from .arc_prize_public_evaluation import *  # noqa: F401, F403
 from .ax import *  # noqa: F401, F403
 from .babilong import *  # noqa: F401, F403
+from .bbeh import *  # noqa: F401, F403
 from .bbh import *  # noqa: F401, F403
 from .bigcodebench import *  # noqa: F401, F403
 from .boolq import *  # noqa: F401, F403
@ -97,6 +98,7 @@ from .mmlu_cf import *  # noqa: F401, F403
 from .mmlu_pro import *  # noqa: F401, F403
 from .MMLUArabic import *  # noqa: F401, F403
 from .mmmlu import *  # noqa: F401, F403
+from .multipl_e import *  # noqa: F401, F403
 from .multirc import *  # noqa: F401, F403
 from .musr import *  # noqa: F401, F403
 from .narrativeqa import *  # noqa: F401, F403
@ -127,6 +129,7 @@ from .strategyqa import *  # noqa: F401, F403
 from .subjective import *  # noqa: F401, F403
 from .summedits import *  # noqa: F401, F403
 from .summscreen import *  # noqa: F401, F403
+from .supergpqa import *  # noqa: F401, F403
 from .svamp import *  # noqa: F401, F403
 from .tabmwp import *  # noqa: F401, F403
 from .taco import *  # noqa: F401, F403
--- a/opencompass/datasets/bbeh.py
+++ b/opencompass/datasets/bbeh.py
@ -0,0 +1,149 @@
+import json
+import os.path as osp
+import re
+from os import environ
+
+from datasets import Dataset
+
+from opencompass.openicl.icl_evaluator import BaseEvaluator
+from opencompass.registry import (ICL_EVALUATORS, LOAD_DATASET,
+                                  TEXT_POSTPROCESSORS)
+from opencompass.utils import get_data_path
+
+from .base import BaseDataset
+
+
+@LOAD_DATASET.register_module()
+class BBEHDataset(BaseDataset):
+
+    @staticmethod
+    def load(path: str, name: str):
+        path = get_data_path(path)
+        if environ.get('DATASET_SOURCE') == 'ModelScope':
+            from modelscope import MsDataset
+            dataset = MsDataset.load(path, subset_name=name, split='test')
+        else:
+            with open(osp.join(path, f'{name}/task.json'), 'r') as f:
+                data = json.load(f)['examples']
+            dataset = Dataset.from_list(data)
+        return dataset
+
+
+@TEXT_POSTPROCESSORS.register_module('bbeh_freeform')
+def bbeh_freeform_postprocess(text: str) -> str:
+    # Extract answer using specified prefixes
+    prefixes = [
+        'The answer is: ', 'The answer is ', 'The final answer is: ',
+        'The final answer is '
+    ]
+    answer = text
+    for prefix in prefixes:
+        if prefix in text:
+            answer = text.split(prefix)[-1]
+            break
+
+    # Remove formatting markup
+    if '\\boxed' in answer:
+        answer = re.sub(r'\\boxed{(.*?)}', r'\1', answer)  # latex box
+    if '\\text' in answer:
+        answer = re.sub(r'\\text(?:tt)?{(.*?)}', r'\1', answer)  # text/texttt
+    if '**' in answer:
+        answer = re.sub(r'\*\*(.*?)\*\*', r'\1', answer)  # bold
+
+    # Take first line and clean
+    if '\n' in answer:
+        answer = answer.split('\n')[0].strip()
+
+    return answer.strip().lower()
+
+
+@TEXT_POSTPROCESSORS.register_module('bbeh_mcq')
+def bbeh_mcq_postprocess(text: str) -> str:
+    # Extract answer using specified prefixes
+    prefixes = [
+        'The answer is: ', 'The answer is ', 'The final answer is: ',
+        'The final answer is '
+    ]
+    answer = text
+    for prefix in prefixes:
+        if prefix in text:
+            answer = text.split(prefix)[-1]
+            break
+
+    # Remove parentheses if present
+    answer = answer.strip('()')
+
+    # Take first line and clean
+    if '\n' in answer:
+        answer = answer.split('\n')[0].strip()
+
+    return answer.strip().lower()
+
+
+@ICL_EVALUATORS.register_module()
+class BBEHEvaluator(BaseEvaluator):
+
+    def score(self, predictions, references):
+        if len(predictions) != len(references):
+            return {
+                'error': 'predictions and references have different length'
+            }
+
+        processed_preds = [bbeh_freeform_postprocess(p) for p in predictions]
+        # References are already in correct format
+        processed_refs = [r.lower() for r in references]
+
+        details = []
+        correct_count = 0
+
+        for pred, ref in zip(processed_preds, processed_refs):
+            correct = False
+
+            # Rule 1: Exact match
+            if pred == ref:
+                correct = True
+            # Rule 2: Match after removing quotes/brackets
+            elif pred == ref.strip("'\"()[]"):
+                correct = True
+            # Rule 4: Comma - separated answers
+            elif ',' in ref:
+                norm_pred = re.sub(r'\s*,\s*', ',', pred)
+                norm_ref = re.sub(r'\s*,\s*', ',', ref)
+                if norm_pred == norm_ref:
+                    correct = True
+
+            details.append({'pred': pred, 'answer': ref, 'correct': correct})
+            correct_count += int(correct)
+
+        score = (correct_count / len(predictions)) * 100
+        return {'score': score, 'details': details}
+
+
+@ICL_EVALUATORS.register_module()
+class BBEHEvaluator_mcq(BaseEvaluator):
+
+    def score(self, predictions, references):
+        if len(predictions) != len(references):
+            return {
+                'error': 'predictions and references have different length'
+            }
+
+        processed_preds = [bbeh_mcq_postprocess(p) for p in predictions]
+        # References are already in correct format
+        processed_refs = [r.lower().strip('()') for r in references]
+
+        details = []
+        correct_count = 0
+
+        for pred, ref in zip(processed_preds, processed_refs):
+            correct = False
+
+            # Rule 1: Exact match
+            if pred == ref:
+                correct = True
+
+            details.append({'pred': pred, 'answer': ref, 'correct': correct})
+            correct_count += int(correct)
+
+        score = (correct_count / len(predictions)) * 100
+        return {'score': score, 'details': details}
--- a/opencompass/datasets/custom.py
+++ b/opencompass/datasets/custom.py
@ -183,6 +183,33 @@ class CustomDataset(BaseDataset):
        return Dataset.from_list(data)


+@LOAD_DATASET.register_module()
+class CodeCustomDataset(BaseDataset):
+
+    @staticmethod
+    def load(path, file_name=None, local_mode=False, num_repeats=1, **kwargs):
+        path = get_data_path(path, local_mode=local_mode)
+        if file_name is not None:
+            path = os.path.join(path, file_name)
+        data = []
+        if path.endswith('.jsonl'):
+            with open(path, 'r', encoding='utf-8') as f:
+                for line in f:
+                    data.extend(
+                        [json.loads(line.strip()) for _ in range(num_repeats)])
+        elif path.endswith('.csv'):
+            with open(path, 'r', encoding='utf-8-sig') as f:
+                reader = csv.reader(f)
+                header = next(reader)
+                for row in reader:
+                    data.extend(
+                        [dict(zip(header, row)) for _ in range(num_repeats)])
+        else:
+            raise ValueError(f'Unsupported file format: {path}')
+
+        return Dataset.from_list(data)
+
+
 class CircularCustomDataset(CustomDataset, metaclass=CircularDatasetMeta):
    dataset_class = CustomDataset

--- a/opencompass/datasets/livecodebench/pass_k_utils.py
+++ b/opencompass/datasets/livecodebench/pass_k_utils.py
@ -53,7 +53,7 @@ def compute_metrics_from_results(results, k_list=[1, 5]):
        k: dict(zip(task_ids, v))
        for k, v in detail_pass_at_k.items()
    }
-    pass_at_k['detail'] = detail_metrics
+    pass_at_k['details'] = detail_metrics
    return pass_at_k


--- a/opencompass/datasets/multipl_e.py
+++ b/opencompass/datasets/multipl_e.py
@ -0,0 +1,103 @@
+import json
+import os.path as osp
+
+from datasets import Dataset
+
+from opencompass.openicl.icl_evaluator.code_evaluator import CodeEvaluator
+from opencompass.registry import LOAD_DATASET
+from opencompass.utils import get_data_path
+
+from .base import BaseDataset
+
+# currently supporting languages
+_HUMANEVAL_LANGUAGE_ = [
+    'adb', 'clj', 'cpp', 'cs', 'd', 'dart', 'elixir', 'go', 'hs', 'java', 'jl',
+    'js', 'lua', 'ml', 'php', 'pl', 'py', 'r', 'rb', 'rkt', 'rs', 'scala',
+    'sh', 'swift', 'ts'
+]
+_MBPP_LANGUAGE_ = [
+    'adb', 'clj', 'cpp', 'cs', 'd', 'elixir', 'go', 'hs', 'java', 'jl', 'js',
+    'lua', 'ml', 'php', 'pl', 'py', 'r', 'rb', 'rkt', 'rs', 'scala', 'sh',
+    'swift', 'ts'
+]
+
+
+@LOAD_DATASET.register_module()
+class MultiplEDataset(BaseDataset):
+
+    @staticmethod
+    def load(path: str,
+             language: str,
+             num_repeats: int = 1,
+             tag: str = 'humaneval',
+             local_mode: bool = False):
+        """Load dataset for pass k mode.
+
+        Args:
+            path(str): The path to the dataset.
+            language(str): The language of the dataset.
+            num_repeats(int): Number of repetition for this dataset to get.
+            tag(str): The tag of the dataset.
+            local_mode(bool): Whether to load the dataset in local mode.
+
+        Returns:
+            Dataset: A PyTorch dataset.
+        """
+        path = get_data_path(path, local_mode=local_mode)
+        assert tag in ['humaneval',
+                       'mbpp'], 'tag must be in ["humaneval", "mbpp"]'
+        if tag == 'humaneval':
+            assert language in _HUMANEVAL_LANGUAGE_, (
+                f'language must be in {_HUMANEVAL_LANGUAGE_}')
+        else:
+            assert language in _MBPP_LANGUAGE_, (
+                f'language must be in {_MBPP_LANGUAGE_}')
+        file_path = osp.join(path, f'{tag}-{language}.jsonl')
+        dataset = []
+        with open(file_path, 'r', encoding='utf-8') as f:
+            for line in f:
+                dataset.extend(
+                    [json.loads(line.strip()) for _ in range(num_repeats)])
+        return Dataset.from_list(dataset)
+
+
+class MultiplEEvaluator(CodeEvaluator):
+
+    def _stop_at_stop_token(self, decoded_string, stop_tokens):
+        """Produces the prefix of decoded_string that ends at the first
+        occurrence of a stop_token.
+
+        WARNING: the decoded_string *must not* include the prompt,
+        which may have stop tokens itself.
+
+        Args:
+            decoded_string: A string generated by the model.
+            stop_tokens: A list of strings, where each string is a stop token.
+        Returns:
+            The decoded_string, truncated at the first occurrence of a stop
+            token.
+        """
+        min_stop_index = len(decoded_string)
+        for stop_token in stop_tokens:
+            stop_index = decoded_string.find(stop_token)
+            if stop_index != -1 and stop_index < min_stop_index:
+                min_stop_index = stop_index
+        return decoded_string[:min_stop_index]
+
+    def _process_completions(self, test_case, completions):
+        """Process completions with a test case.
+
+        Args:
+            test_case: A test case.
+            completions: A list of completions.
+        Returns:
+            A list of processed completions.
+        """
+        processed_completions = []
+        for comp in completions:
+            comp = self._extract_code(comp)
+            post_comp = self._remove_prefix(test_case['prompt'], comp)
+            post_comp = self._stop_at_stop_token(post_comp,
+                                                 test_case['stop_tokens'])
+            processed_completions.append(post_comp)
+        return processed_completions
--- a/opencompass/datasets/supergpqa/init.py
+++ b/opencompass/datasets/supergpqa/init.py
--- a/opencompass/datasets/supergpqa/supergpqa.py
+++ b/opencompass/datasets/supergpqa/supergpqa.py
@ -0,0 +1,182 @@
+import os
+
+from datasets import Dataset, load_dataset
+
+from opencompass.datasets.supergpqa.supergpqa_eval import (
+    extract_option_content, extract_option_labels)
+from opencompass.datasets.supergpqa.supergpqa_utils import load_yaml
+from opencompass.openicl.icl_evaluator import BaseEvaluator
+from opencompass.registry import ICL_EVALUATORS, LOAD_DATASET
+
+from ..base import BaseDataset
+
+
+def _parse(item, template, prompt_mode):
+    prompt_format = [
+        item['question'] + '\n' + '\n'.join([
+            f'{chr(65+i)}) {option}'
+            for i, option in enumerate(item['options'])
+        ])
+    ]
+    item['infer_prompt'] = template['prompt_format'][0].format(*prompt_format)
+    item['prompt_mode'] = prompt_mode
+    return item
+
+
+@LOAD_DATASET.register_module()
+class SuperGPQADataset(BaseDataset):
+
+    @staticmethod
+    def load(path: str, prompt_mode: str, **kwargs):
+        dataset = load_dataset(path, split='train')
+
+        # get prompt template
+        template_path = None
+        if prompt_mode == 'zero-shot':
+            template_path = os.path.join(
+                os.path.dirname(__file__),
+                'supergpqa_dataset_config/prompt/zero-shot.yaml',
+            )
+        elif prompt_mode == 'five-shot':
+            template_path = os.path.join(
+                os.path.dirname(__file__),
+                'supergpqa_dataset_config/prompt/five-shot.yaml',
+            )
+        try:
+            template = load_yaml(template_path)
+        except FileNotFoundError:
+            print(f'[ERROR] Missing prompt template: {template_path}')
+            return Dataset.from_list([])
+
+        dataset = dataset.map(lambda item: _parse(item, template, prompt_mode))
+        return dataset
+
+
+@ICL_EVALUATORS.register_module()
+class SuperGPQAEvaluator(BaseEvaluator):
+
+    def __init__(self):
+        super().__init__()
+
+    def score(self, predictions, references, test_set):
+        mode = test_set[0]['prompt_mode']
+        acc = 0
+        count = 0
+        err = 0
+        miss = 0
+        acc_difficulty = {'hard': 0, 'middle': 0, 'easy': 0}
+        count_difficulty = {'hard': 0, 'middle': 0, 'easy': 0}
+        stats = {'discipline': {}, 'field': {}, 'subfield': {}}
+        details = []
+        for i, sample in enumerate(test_set):
+            sample['pred'] = prediction = predictions[i]
+            gold = references[i]
+            if mode == 'zero-shot':
+                predict = extract_option_labels(prediction, 'ABCDEFGHIJ')
+                if predict is None:
+                    predict = extract_option_content(prediction,
+                                                     sample['options'])
+                    predict = (chr(sample['options'].index(predict) +
+                                   65) if predict else None)
+                sample['extracted_answer'] = predict
+            elif mode == 'five-shot':
+                response = prediction.split('Question:')[0]
+                predict = extract_option_labels(response, 'ABCDEFGHIJ')
+                if predict is None:
+                    predict = extract_option_content(response,
+                                                     sample['options'])
+                    predict = (chr(sample['options'].index(predict) +
+                                   65) if predict else None)
+                if predict is None:
+                    predict = extract_option_labels(prediction, 'ABCDEFGHIJ')
+                    if predict is None:
+                        predict = extract_option_content(
+                            prediction, sample['options'])
+                        predict = (chr(sample['options'].index(predict) +
+                                       65) if predict else None)
+                sample['extracted_answer'] = predict
+
+            discipline = sample.get('discipline', 'unknown')
+            field = sample.get('field', 'unknown')
+            subfield = sample.get('subfield', 'unknown')
+            difficulty = sample.get('difficulty', 'unknown')
+
+            for level, key in [
+                ('discipline', discipline),
+                    # ('field', f"{discipline}/{field}"),
+                    # ('subfield', f"{discipline}/{field}/{subfield}"),
+            ]:
+                if key not in stats[level]:
+                    stats[level][key] = {
+                        'correct': 0,
+                        'total': 0,
+                        'miss': 0,
+                        'error': 0,
+                        'discipline': discipline,
+                        'field': field,
+                        'subfield': subfield,
+                        'difficulty': {
+                            'easy': {
+                                'correct': 0,
+                                'total': 0
+                            },
+                            'middle': {
+                                'correct': 0,
+                                'total': 0
+                            },
+                            'hard': {
+                                'correct': 0,
+                                'total': 0
+                            },
+                        },
+                    }
+
+                stats[level][key]['total'] += 1
+                stats[level][key]['difficulty'][difficulty]['total'] += 1
+
+                answer_letter = sample['answer_letter']
+                assert answer_letter == gold
+                if predict and answer_letter == predict:
+                    acc += 1
+                    acc_difficulty[difficulty] += 1
+                    sample['status'] = 'correct'
+                    stats[level][key]['correct'] += 1
+                    stats[level][key]['difficulty'][difficulty]['correct'] += 1
+                elif predict is None or predict == '':
+                    miss += 1
+                    sample['status'] = 'miss'
+                    stats[level][key]['miss'] += 1
+                elif predict == 'error':
+                    err += 1
+                    sample['status'] = 'error'
+                    stats[level][key]['error'] += 1
+                else:
+                    sample['status'] = 'incorrect'
+                count += 1
+                count_difficulty[difficulty] += 1
+                details.append({
+                    'pred': sample['pred'],
+                    'answer': sample['answer'],
+                    'parsed_answer': sample['extracted_answer'],
+                    'correct': True if sample['status'] else False,
+                })
+
+        return {
+            'accuracy':
+            acc / count if count > 0 else 0,
+            'error_rate':
+            err / count if count > 0 else 0,
+            'miss_rate':
+            miss / count if count > 0 else 0,
+            'hard_accuracy':
+            (acc_difficulty['hard'] /
+             count_difficulty['hard'] if count_difficulty['hard'] > 0 else 0),
+            'middle_accuracy':
+            (acc_difficulty['middle'] / count_difficulty['middle']
+             if count_difficulty['middle'] > 0 else 0),
+            'easy_accuracy':
+            (acc_difficulty['easy'] /
+             count_difficulty['easy'] if count_difficulty['easy'] > 0 else 0),
+            'details':
+            details,
+        }
--- a/opencompass/datasets/supergpqa/supergpqa_dataset_config/config_default.yaml
+++ b/opencompass/datasets/supergpqa/supergpqa_dataset_config/config_default.yaml
@ -0,0 +1,17 @@
+response_key: 'response'
+error_key: 'error'
+id_key:
+  - 'uuid'
+prompt_key: 'prompt'
+
+
+
+history_key: 'history'
+status_key: 'status'
+
+save_prompt: True
+max_tokens: 4096
+temperatrue: 0.0
+
+max_rounds: 30
+BoN: 32
--- a/opencompass/datasets/supergpqa/supergpqa_dataset_config/config_reasoning_models.yaml
+++ b/opencompass/datasets/supergpqa/supergpqa_dataset_config/config_reasoning_models.yaml
@ -0,0 +1,17 @@
+response_key: 'response'
+error_key: 'error'
+id_key:
+  - 'uuid'
+prompt_key: 'prompt'
+
+
+
+history_key: 'history'
+status_key: 'status'
+
+save_prompt: True
+max_tokens: 32768
+temperatrue: 0.0
+
+max_rounds: 30
+BoN: 32
--- a/opencompass/datasets/supergpqa/supergpqa_dataset_config/config_wrapper.py
+++ b/opencompass/datasets/supergpqa/supergpqa_dataset_config/config_wrapper.py
@ -0,0 +1,88 @@
+import yaml
+
+
+class ConfigWrapper:
+
+    def __init__(self, config_path):
+        self._config = {}
+        with open(config_path, 'r') as file:
+            self._config = yaml.safe_load(file)
+        for key, value in self._config.items():
+            setattr(self, key, value)
+
+    def __setattr__(self, key, value):
+        if key.startswith('_'):
+            super().__setattr__(key, value)
+        else:
+            self._config[key] = value
+            super().__setattr__(key, value)
+
+    def __getattr__(self, key):
+        if key in self._config:
+            return self._config[key]
+        raise AttributeError(
+            f"'ConfigWrapper' object has no attribute '{key}'")
+
+    def get_id(self, data):
+        if isinstance(self._config.get('id_key'), str):
+            return data.get(self._config.get('id_key'), None)
+        elif isinstance(self._config.get('id_key'), list):
+            return '_'.join([
+                str(data[key]) for key in self._config.get('id_key')
+                if key in data
+            ])
+
+    def print_all_keys(self):
+        print('config keys:')
+        for key, value in self._config.items():
+            print(f'  - {key}: {value}')
+
+
+config_wrapper = None
+
+
+def initialize_config(config_path):
+    global config_wrapper
+    config_wrapper = ConfigWrapper(config_path)
+
+
+def get_config_wrapper():
+    global config_wrapper
+    if config_wrapper is None:
+        raise RuntimeError(
+            'ConfigWrapper not initialized. Call initialize_config first.')
+    return config_wrapper
+
+
+if __name__ == '__main__':
+    config_path = 'config/config.yaml'
+    initialize_config(config_path)
+    data = {
+        'idx':
+        '50',
+        'step':
+        21,
+        'question':
+        'Ciphertext: "17,156,4,54,213,17,23,84,228,54,281"\n\n'
+        'Please provide the decrypted answer, encapsulated in double square'
+        ' brackets. For example, the format should be: [[decrypted answer]].',
+        'answer':
+        '[[P]]',
+        'category':
+        'Decryption',
+        'rule_id':
+        '23',
+        'input':
+        'Ciphertext: "17,156,4,54,213,17,23,84,228,54,281"',
+        'steps_num':
+        23,
+        'description':
+        'For a number c=228 in the ciphertext:\n'
+        'Calculate z = c^e mod n. Here ^ means multiplication.\nz is 80.'
+        '\nBased on the decimal number represented by z, use the ascii '
+        'code to find the corresponding letter as the plaintext letter p.'
+        '\nPlease give the letter p in [[...]] format.\n',
+        'atom':
+        80,
+    }
+    print(config_wrapper.get_id(data))
--- a/opencompass/datasets/supergpqa/supergpqa_dataset_config/prompt/five-shot.yaml
+++ b/opencompass/datasets/supergpqa/supergpqa_dataset_config/prompt/five-shot.yaml
@ -0,0 +1,91 @@
+prompt_format: 
+  - |
+    Answer the following multiple choice question. There is only one correct answer. The last line of your response should be in the format 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, D, E, F, G, H, I, or J.
+    
+    Question: 
+    A refracting telescope consists of two converging lenses separated by 100 cm. The eye-piece lens has a focal length of 20 cm. The angular magnification of the telescope is
+    A) 10
+    B) 40
+    C) 6
+    D) 25
+    E) 15
+    F) 50
+    G) 30
+    H) 4
+    I) 5
+    J) 20
+
+    Answer: Let's think step by step. In a refracting telescope, if both lenses are converging, the focus of both lenses must be between the two lenses, and thus the focal lengths of the two lenses must add up to their separation. Since the focal length of one lens is 20 cm, the focal length of the other must be 80 cm. The magnification is the ratio of these two focal lengths, or 4.
+    Answer: H.
+    
+    Question: 
+    Say the pupil of your eye has a diameter of 5 mm and you have a telescope with an aperture of 50 cm. How much more light can the telescope gather than your eye? 
+    A) 1000 times more
+    B) 50 times more
+    C) 5000 times more
+    D) 500 times more
+    E) 10000 times more
+    F) 20000 times more
+    G) 2000 times more
+    H) 100 times more
+    I) 10 times more
+    J) N/A
+
+    Answer: Let's think step by step. The amount of light a telescope can gather compared to the human eye is proportional to the area of its apertures. The area of a circle is given by the formula $A = \pi \left(\frac{{D}}{{2}}\right)^2$, where $D$ is the diameter. Therefore, the relative light-gathering power is calculated as:
+    \[
+    \frac{{\left(\frac{{50 \text{{ cm}}}}{{2}}\right)^2}}{{\left(\frac{{5 \text{{ mm}}}}{{2}}\right)^2}} = \frac{{\left(\frac{{50 \text{{ cm}}}}{{0.1 \text{{ cm}}}}\right)^2}}{{\left(\frac{{5 \text{{ mm}}}}{{0.1 \text{{ cm}}}}\right)^2}} = \frac{{500^2}}{{5^2}} = 10000.
+    \]
+    Answer: E.
+    
+    Question: 
+    Where do most short-period comets come from and how do we know? 
+    A) The Kuiper belt; short period comets tend to be in the plane of the solar system like the Kuiper belt. 
+    B) The asteroid belt; short period comets tend to come from random directions indicating a spherical distribution of comets called the asteroid belt. 
+    C) The asteroid belt; short period comets tend to be in the plane of the solar system just like the asteroid belt. 
+    D) The Oort cloud; short period comets have orbital periods similar to asteroids like Vesta and are found in the plane of the solar system just like the Oort cloud. 
+    E) The Oort Cloud; short period comets tend to come from random directions indicating a spherical distribution of comets called the Oort Cloud. 
+    F) The Oort cloud; short period comets tend to be in the plane of the solar system just like the Oort cloud. 
+    G) The asteroid belt; short period comets have orbital periods similar to asteroids like Vesta and are found in the plane of the solar system just like the asteroid belt. 
+    Answer: Let's think step by step. Most short-period comets originate from the Kuiper belt. This is deduced from the observation that these comets tend to follow orbits that lie in the plane of the solar system, similar to the distribution of objects in the Kuiper belt itself. Thus, the alignment of these cometary orbits with the ecliptic plane points to their Kuiper belt origin.
+    Answer: A.
+    
+    Question: 
+    Colors in a soap bubble result from light 
+    A) dispersion 
+    B) deflection 
+    C) refraction 
+    D) reflection 
+    E) interference 
+    F) converted to a different frequency 
+    G) polarization 
+    H) absorption 
+    I) diffraction 
+    J) transmission 
+
+    Answer: Let's think step by step. The colorful patterns observed in a soap bubble are caused by the phenomenon of light interference. This occurs when light waves bounce between the two surfaces of the soap film, combining constructively or destructively based on their phase differences and the varying thickness of the film. These interactions result in vibrant color patterns due to variations in the intensity of different wavelengths of light.
+    Answer: E.
+    
+    Question: 
+    A microwave oven is connected to an outlet, 120 V, and draws a current of 2 amps. At what rate is energy being used by the microwave oven? 
+    A) 240 W
+    B) 120 W
+    C) 10 W
+    D) 480 W
+    E) 360 W
+    F) 200 W
+    G) 30 W
+    H) 150 W
+    I) 60 W
+    J) 300 W
+
+    Answer: Let's think step by step. The rate of energy usage, known as power, in an electrical circuit is calculated by the product of voltage and current. For a microwave oven connected to a 120 V outlet and drawing a current of 2 amps, the power consumption can be calculated as follows:
+    \[
+    \text{{Power}} = \text{{Voltage}} \times \text{{Current}} = 120 \, \text{{V}} \times 2 \, \text{{A}} = 240 \, \text{{W}}.
+    \]
+    Therefore, the microwave oven uses energy at a rate of 240 watts.
+    Answer: A.
+    
+    Question: 
+    {}
+    
+    Answer: Let's think step by step. 
--- a/opencompass/datasets/supergpqa/supergpqa_dataset_config/prompt/robustness-exp.yaml
+++ b/opencompass/datasets/supergpqa/supergpqa_dataset_config/prompt/robustness-exp.yaml
@ -0,0 +1,23 @@
+initial_prompt_0: 
+  - |
+    Answer the following multiple choice question. There is only one correct answer. The last line of your response should be in the format 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, D, E, F, G, H, I, or J.
+
+    {}
+
+initial_prompt_1:
+  - |
+    You are a helpful assistant. Answer the given multiple-choice question. Only one option is correct. The last line of your response should be in the format 'The correct answer is: $LETTER', where LETTER is one of A, B, C, D, E, F, G, H, I, or J.
+
+    {}
+
+initial_prompt_2:
+  - |
+    Select the correct answer for the following multiple-choice question. There is only one valid choice. The last line of your response should be in the format 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, D, E, F, G, H, I, or J.
+
+    {}
+
+initial_prompt_3:
+  - |
+    Review the following multiple-choice question and choose the one correct answer. Ensure that your response concludes with a line exactly formatted as 'The correct answer is: $LETTER', where LETTER represents one of A, B, C, D, E, F, G, H, I, or J.
+
+    {}
--- a/opencompass/datasets/supergpqa/supergpqa_dataset_config/prompt/zero-shot-with-subfield.yaml
+++ b/opencompass/datasets/supergpqa/supergpqa_dataset_config/prompt/zero-shot-with-subfield.yaml
@ -0,0 +1,5 @@
+prompt_format: 
+  - |
+    Answer the following multiple choice question about {}. There is only one correct answer. The last line of your response should be in the format 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, D, E, F, G, H, I, or J.
+    
+    {}
--- a/opencompass/datasets/supergpqa/supergpqa_dataset_config/prompt/zero-shot.yaml
+++ b/opencompass/datasets/supergpqa/supergpqa_dataset_config/prompt/zero-shot.yaml
@ -0,0 +1,5 @@
+prompt_format: 
+  - |
+    Answer the following multiple choice question. There is only one correct answer. The last line of your response should be in the format 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, D, E, F, G, H, I, or J.
+    
+    {}
--- a/opencompass/datasets/supergpqa/supergpqa_eval.py
+++ b/opencompass/datasets/supergpqa/supergpqa_eval.py
@ -0,0 +1,96 @@
+# flake8: noqa: W605
+import re
+
+import timeout_decorator
+
+
+@timeout_decorator.timeout(5)  # 5 seconds timeout
+def safe_regex_search(pattern, text, flags=0):
+    try:
+        return re.search(pattern, text, flags)
+    except timeout_decorator.TimeoutError:
+        print(f'Regex match timeout: pattern={pattern}, text={text[:100]}...')
+        return None
+    except Exception as e:
+        print(f'Regex match error: {str(e)}')
+        return None
+
+
+def extract_option_labels(text, options='ABCDEFGHIJ'):
+    if not isinstance(text, str) or not isinstance(options, str):
+        return 'error'
+
+    text = text.rstrip()
+    last_line = text.split('\n')[-1]
+
+    option_str = ''.join([chr(65 + i) for i in range(len(options))
+                          ]) if options else 'ABCDEFGHIJ'
+
+    patterns = [
+        # e.g. "The final answer to this question is: A."
+        #      "The best option is $\boxed{B}:"
+        #      "The correct answer is (C)."
+        f'[Tt]he\s+(?:\w+\s+)?(?:answer|option)(?:\w+\s+)?\s+is?:?\s*(?:[\*\$\\{{(\[\\\\(]*?(?:(?:\\\\boxed|\\\\mathbf|\\\\mathrm|\\\\text){{)?)*\s*([{option_str}])(?:\\\\?\}}?\$?\)?\]?\}}?)*(?:[\s:\.\*)]|$)',
+
+        # e.g. "ANSWER: A"
+        #      "Answer: $\boxed{B}."
+        #      "ANSWER: (C):"
+        f'(?i:Answer)[\*\s]*:\s*(?:[\*\$\\{{(\[\\\\(]*?(?:(?:\\\\boxed|\\\\mathbf|\\\\mathrm|\\\\text){{)?)*\s*([{option_str}])(?:\\\\?\}}?\$?\)?\]?\}}?)*(?:[\s:\.\*)]|$)',
+
+        # e.g. "A"
+        #      "$\boxed{B}$"
+        #      "(C)."
+        #      "[D]:"
+        f'^[^\w\r\n]*(?:[\*\$\\{{(\[\\\\(]*?(?:(?:\\\\boxed|\\\\mathbf|\\\\mathrm|\\\\text){{)?)*\s*([{option_str}])(?:\\\\?\}}?\$?\)?\]?\}}?)*(?:[\s:\.\*)]|$)',
+    ]
+
+    for pattern in patterns:
+        match = safe_regex_search(pattern, last_line, re.IGNORECASE)
+        if match:
+            return match.group(1)
+
+    for pattern in patterns:
+        match = safe_regex_search(pattern, text, re.IGNORECASE)
+        if match:
+            return match.group(1)
+
+    return None
+
+
+def extract_option_content(text, options_content=None):
+    if not isinstance(text, str) or not isinstance(options_content, list):
+        return 'error'
+
+    escaped_options_content = [
+        re.escape(option_content) for option_content in options_content
+    ]
+    escaped_options_content_str = '|'.join(escaped_options_content)
+
+    text = text.rstrip()
+    last_line = text.split('\n')[-1]
+
+    patterns = [
+        f'[Tt]he\s+(?:\w+\s+)?(?:answer|option)(?:\w+\s+)?\s+is:?\s*(?:[\*\$\\{{\(\[\\\\(]*?(?:(?:\\\\boxed|\\\\mathbf|\\\\mathrm|\\\\text){{)?)*\s*({escaped_options_content_str})(?:\\\\?\}}?\$?\)?\]?\}}?)*(?:[\s:\.\*)]|$)',
+        f'(?i:Answer)\s*(?:[\*\$\\{{\(\[\\\\(]*?(?:(?:\\\\boxed|\\\\mathbf|\\\\mathrm|\\\\text){{)?)*\s*({escaped_options_content_str})(?:\\\\?\}}?\$?\)?\]?\}}?)*(?:[\s:\.\*)]|$)',
+        f'^[^\w\r\n]*(?:[\*\$\\{{\(\[\\\\(]*?(?:(?:\\\\boxed|\\\\mathbf|\\\\mathrm|\\\\text){{)?)*\s*({escaped_options_content_str})(?:\\\\?\}}?\$?\)?\]?\}}?)*(?:[\s:\.\*)]|$)',
+    ]
+
+    for pattern in patterns:
+        match = safe_regex_search(pattern, last_line)
+        if match:
+            if match.group(1) in escaped_options_content:
+                return options_content[escaped_options_content.index(
+                    match.group(1))]
+            else:
+                return match.group(1)
+
+    for pattern in patterns:
+        match = safe_regex_search(pattern, text)
+        if match:
+            if match.group(1) in escaped_options_content:
+                return options_content[escaped_options_content.index(
+                    match.group(1))]
+            else:
+                return match.group(1)
+
+    return None
--- a/opencompass/datasets/supergpqa/supergpqa_utils.py
+++ b/opencompass/datasets/supergpqa/supergpqa_utils.py
@ -0,0 +1,693 @@
+import json
+import os
+import re
+
+import sympy as sp
+import yaml
+from sympy.parsing.latex import parse_latex
+
+
+def load_yaml(yaml_path):
+    """Load a YAML file."""
+    if not os.path.exists(yaml_path):
+        raise FileNotFoundError(f'YAML file not found: {yaml_path}')
+    with open(yaml_path, 'r', encoding='utf-8') as file:
+        return yaml.safe_load(file)
+
+
+def load_json_or_jsonl(file_path):
+    """Load data from a JSON or JSONL file."""
+    if not os.path.exists(file_path):
+        return None
+    with open(file_path, 'r', encoding='utf-8') as file:
+        if file_path.endswith('.json'):
+            return json.load(file)
+        elif file_path.endswith('.jsonl'):
+            return [json.loads(line) for line in file]
+    return None
+
+
+def find_file(base_path, sub_path, extensions=('json', 'jsonl')):
+    """Find the first available file with given extensions."""
+    for ext in extensions:
+        file_path = os.path.join(base_path, f'{sub_path}.{ext}')
+        if os.path.exists(file_path):
+            return file_path
+    return None
+
+
+def load_json_or_jsonl_with_idx(data_path, split='', idx=None):
+    base_path = os.path.join(data_path, split)
+    if os.path.exists(f'{base_path}.json'):
+        file_path = f'{base_path}.json'
+    elif os.path.exists(f'{base_path}.jsonl'):
+        file_path = f'{base_path}.jsonl'
+    elif base_path.endswith('.json') or base_path.endswith('.jsonl'):
+        file_path = base_path
+    else:
+        raise FileNotFoundError('No JSON or JSONL file found.')
+
+    with open(file_path, 'r', encoding='utf-8') as file:
+        if file_path.endswith('.json'):
+            data = json.load(file)
+        elif file_path.endswith('.jsonl'):
+            data = [json.loads(line) for line in file]
+
+    if idx is not None:
+        try:
+            return next(item for item in data if item.get('idx') == idx)
+        except StopIteration:
+            raise ValueError(f'No entry found for idx {idx}')
+    else:
+        return data
+
+
+def load_split_data(base_path, split_name):
+    """Load the rule and sample data for a specific split."""
+    split_path = os.path.join(base_path, split_name)
+    rule_path = find_file(split_path, 'rule')
+    sample_path = find_file(split_path, 'sample')
+
+    rules = load_json_or_jsonl(rule_path) if rule_path else []
+    samples = load_json_or_jsonl(sample_path) if sample_path else []
+
+    return {'rules': rules, 'samples': samples}
+
+
+def process_mixed_data(base_path, mode):
+    """Load and process data for the 'mixed' split and specific mode."""
+    mixed_path = os.path.join(base_path, 'mixed')
+    file_path = find_file(mixed_path, mode)
+    if not file_path:
+        print(f'[WARNING] Missing file for mixed mode: {mode}')
+        return []
+
+    data = load_json_or_jsonl(file_path)
+    template_path = os.path.join(base_path, 'config/prompt/mixed.yaml')
+    template = load_yaml(template_path)
+
+    processed = []
+    for item in data:
+        rules = '\n'.join(item.get('rule_list', []))
+        questions = '\n'.join(item.get('question_list', []))
+        item['prompt'] = template['prompt_format'][0].format(rules, questions)
+        processed.append(item)
+
+    return processed
+
+
+class ConfigWrapper:
+
+    def __init__(self, config_path):
+        self._config = {}
+        with open(config_path, 'r') as file:
+            self._config = yaml.safe_load(file)
+        for key, value in self._config.items():
+            setattr(self, key, value)
+
+    def __setattr__(self, key, value):
+        if key.startswith('_'):
+            super().__setattr__(key, value)
+        else:
+            self._config[key] = value
+            super().__setattr__(key, value)
+
+    def __getattr__(self, key):
+        if key in self._config:
+            return self._config[key]
+        raise AttributeError(
+            f"'ConfigWrapper' object has no attribute '{key}'")
+
+    def get_id(self, data):
+        if isinstance(self._config.get('id_key'), str):
+            return data.get(self._config.get('id_key'), None)
+        elif isinstance(self._config.get('id_key'), list):
+            return '_'.join([
+                str(data[key]) for key in self._config.get('id_key')
+                if key in data
+            ])
+
+    def print_all_keys(self):
+        print('config keys:')
+        for key, value in self._config.items():
+            print(f'  - {key}: {value}')
+
+
+config_wrapper = None
+
+
+def initialize_config(config_path):
+    global config_wrapper
+    config_wrapper = ConfigWrapper(config_path)
+
+
+def get_config_wrapper():
+    global config_wrapper
+    if config_wrapper is None:
+        raise RuntimeError(
+            'ConfigWrapper not initialized. Call initialize_config first.')
+    return config_wrapper
+
+
+if __name__ == '__main__':
+    config_path = 'config/config.yaml'
+    initialize_config(config_path)
+    data = {
+        'idx':
+        '50',
+        'step':
+        21,
+        'question':
+        ('Ciphertext: "17,156,4,54,213,17,23,84,228,54,281"\n\n'
+         'Please provide the decrypted answer, encapsulated in double '
+         'square brackets. '
+         'For example, the format should be: [[decrypted answer]].'),
+        'answer':
+        '[[P]]',
+        'category':
+        'Decryption',
+        'rule_id':
+        '23',
+        'input':
+        'Ciphertext: "17,156,4,54,213,17,23,84,228,54,281"',
+        'steps_num':
+        23,
+        'description':
+        ('For a number c=228 in the ciphertext:\n'
+         'Calculate z = c^e mod n. Here ^ means multiplication.\n'
+         'z is 80.\nBased on the decimal number represented by z, '
+         'use the ascii code to find the corresponding letter '
+         'as the plaintext letter p.\n'
+         'Please give the letter p in [[...]] format.\n'),
+        'atom':
+        80
+    }
+    print(config_wrapper.get_id(data))
+
+
+def read_yaml(config='default'):
+    if os.path.exists(f'config/prompt/{config}.yaml'):
+        yaml_file = f'config/prompt/{config}.yaml'
+    else:
+        yaml_file = config
+    with open(yaml_file, 'r') as yaml_file:
+        return yaml.safe_load(yaml_file)
+
+
+def write_jsonl_lines(file, data):
+    config_wrapper = get_config_wrapper()
+    if config_wrapper.save_prompt:
+        json.dump(data, file, ensure_ascii=False)
+    else:
+        data.pop(config_wrapper.prompt_key)
+        json.dump(data, file, ensure_ascii=False)
+    file.write('\n')
+    file.flush()
+
+
+def print_info(info):
+    print('-' * 100)
+    print('[INFO] model_name:', info['model_name'])
+    print('[INFO] splits:', info['splits'])
+    print('[INFO] modes:', info['modes'])
+    print('[INFO] output_dir:', info['output_dir'])
+    print('[INFO] Infer Limit:',
+          'No limit' if info['infer_limit'] is None else info['infer_limit'])
+    print('[INFO] Number of Workers:', info['num_workers'])
+    print('[INFO] Batch Size:', info['batch_size'])
+    print('[INFO] Use Accel:', info['use_accel'])
+    print('-' * 100)
+
+
+def read_json_or_jsonl(data_path, split='', mapping_key=None):
+    base_path = os.path.join(data_path, split)
+    if os.path.exists(f'{base_path}.json'):
+        file_path = f'{base_path}.json'
+    elif os.path.exists(f'{base_path}.jsonl'):
+        file_path = f'{base_path}.jsonl'
+    elif base_path.endswith('.json') or base_path.endswith('.jsonl'):
+        file_path = base_path
+    else:
+        raise FileNotFoundError('No JSON or JSONL file found.')
+
+    with open(file_path, 'r') as file:
+        if file_path.endswith('.json'):
+            data = json.load(file)
+        elif file_path.endswith('.jsonl'):
+            data = [json.loads(line) for line in file]
+
+    if mapping_key:
+        return {
+            item[mapping_key]: item
+            for item in data if mapping_key in item
+        }
+    else:
+        return data
+
+
+def read_json_or_jsonl_with_idx(data_path, split='', idx=None):
+    base_path = os.path.join(data_path, split)
+    if os.path.exists(f'{base_path}.json'):
+        file_path = f'{base_path}.json'
+    elif os.path.exists(f'{base_path}.jsonl'):
+        file_path = f'{base_path}.jsonl'
+    elif base_path.endswith('.json') or base_path.endswith('.jsonl'):
+        file_path = base_path
+    else:
+        raise FileNotFoundError('No JSON or JSONL file found.')
+
+    with open(file_path, 'r', encoding='utf-8') as file:
+        if file_path.endswith('.json'):
+            data = json.load(file)
+        elif file_path.endswith('.jsonl'):
+            data = [json.loads(line) for line in file]
+
+    if idx is not None:
+        try:
+            return next(item for item in data if item.get('idx') == idx)
+        except StopIteration:
+            raise ValueError(f'No entry found for idx {idx}')
+    else:
+        return data
+
+
+idx_ranges = [
+    [18],
+    [73, 74, 77],
+    [94],
+    [115, 116, 117],
+    [121, 122, 123, 125],
+    [131, 132, 134, 135, 136],
+    [141, 143, 149],
+    list(range(145, 148)),
+    list(range(151, 157)),
+    [160, 161, 162],
+    [164, 165, 166],
+    [170],
+    [206, 209],
+    list(range(211, 216)),
+    [217, 218],
+]
+
+
+def clean_json_string(json_str):
+    json_str = re.sub(r'[\x00-\x1F\x7F]', '', json_str)
+    return json_str
+
+
+def is_in_idx_ranges(idx, idx_ranges):
+    for range_list in idx_ranges:
+        if int(idx) in range_list:
+            return True
+    return False
+
+
+def extract_json(text):
+    matches = re.findall(r'{.*}', text, re.DOTALL)
+    if matches:
+        json_str = matches[-1]
+        json_str = clean_json_string(json_str)
+        try:
+            data = json.loads(json_str)
+            return data
+        except json.JSONDecodeError as e:
+            print(f'Error decoding JSON: {e}')
+            return 'NULL'
+    return 'NULL'
+
+
+def extract_all_responses_from_json(response_json):
+    results = []
+    for key, value in response_json.items():
+        results.append(str(value))
+    return results
+
+
+def clean_latex(latex_expr):
+    if '=' in latex_expr:
+        latex_expr = latex_expr.rsplit('=', 1)[1]
+    latex_expr = re.sub(r'\\[()\[\]]', '', latex_expr)
+    latex_expr = re.sub(r'\\text\{.*?\}', '', latex_expr)
+    latex_expr = re.sub(r'\\(left|right|displaystyle)', '', latex_expr)
+    latex_expr = latex_expr.replace('\\\\', '\\')
+    return latex_expr
+
+
+def extract_text_from_brackets(text, clean_level='basic'):
+    matches = re.findall(r'\[\[\s*(.*?)\s*\]\]', text, re.DOTALL)
+    if not matches:
+        matches = re.findall(r'\$\\boxed\{(.*?)\}\$', text, re.DOTALL)
+    if not matches:
+        matches = re.findall(r'\[\s*(.*?)\s*\]', text, re.DOTALL)
+    if matches:
+        match_str = matches[0].strip()
+        if clean_level == 'clean':
+            match_str = match_str.replace('"', '').replace('\n', '').replace(
+                ' ', '').replace('[', '').replace(']', '')
+        elif clean_level == 'logic':
+            match_str = match_str.replace('"', '').replace('\n', '').replace(
+                ' ', '').replace('.', '')
+        elif clean_level == 'math':
+            match_str = match_str.replace('"', '').replace('\n', '').replace(
+                '[', '').replace(']', '').replace('$', '')
+            return f'{clean_latex(match_str)}'
+        return f'[[{match_str}]]'
+    return 'NULL'
+
+
+def extract_inner_text_from_brackets(text):
+    if not isinstance(text, str):
+        print(f'text type: {type(text)}, text value: {text}')
+        return 'NULL'
+    match = re.search(r'\[\[(.*?)\]\]', text, re.DOTALL)
+    return match.group(1) if match else 'NULL'
+
+
+def extract_numbers(str):
+    numbers = re.findall(r'\d+', str)
+    numbers = list(map(int, numbers))
+    return numbers
+
+
+def extract_and_sort_inequalities(latex_expr):
+    pattern = r'(≥|≤)\s*([-]?\d+\.?\d*)'
+    matches = re.findall(pattern, latex_expr)
+    extracted_inequalities = [''.join(match) for match in matches]
+    sorted_inequalities = sorted(extracted_inequalities)
+    return sorted_inequalities
+
+
+def rule5_normalize_content(content):
+    parts = [part for part in content.split(';')]
+    sorted_parts = sorted(parts)
+    return sorted_parts
+
+
+def normalize_string(s):
+    s = re.sub(r'[^0-9]', '', s)
+    pairs = s.split(',')
+    pairs.sort()
+    return pairs
+
+
+def remove_commas_and_spaces(s):
+    return re.sub(r'[,\s\[\]]+', '', s)
+
+
+def remove_non_alphanumeric(s):
+    return re.sub(r'\W+', '', s)
+
+
+def contains_or(answer):
+    return 'or' in answer
+
+
+def compare_multi_results(response, answer):
+    try:
+        response_text = extract_text_from_brackets(response, 'clean')
+        response_text = re.sub(r'\\text\{or\}', 'or', response_text)
+        if response_text == 'NULL':
+            return False
+        answer = extract_text_from_brackets(answer, 'clean')
+        response_split = response_text.strip('[[]]').split('or')
+        answer_split = answer.strip('[[]]').split('or')
+        response_sorted = sorted([x.strip() for x in response_split])
+        answer_sorted = sorted([x.strip() for x in answer_split])
+        return response_sorted == answer_sorted
+    except Exception as e:
+        print(f'Error during comparison: {e}')
+        return False
+
+
+def split_or_expression(expression):
+    return [part.strip() for part in expression.split('or')]
+
+
+def compare_math_expressions(response, answer):
+    response_text = extract_text_from_brackets(response, 'math')
+    answer_text = extract_text_from_brackets(answer, 'math')
+    if response_text == 'NULL':
+        return False
+    if contains_or(answer_text):
+        response_parts = split_or_expression(response_text)
+        answer_parts = split_or_expression(answer_text)
+        try:
+            response_exprs = {
+                sp.simplify(parse_latex(part))
+                for part in response_parts
+            }
+            answer_exprs = {
+                sp.simplify(parse_latex(part))
+                for part in answer_parts
+            }
+            return response_exprs == answer_exprs
+        except Exception as e:
+            print(f'Error during simplification or parsing: {e}')
+            return response_text == answer_text
+    else:
+        try:
+            response_expr = sp.simplify(parse_latex(response_text))
+            answer_expr = sp.simplify(parse_latex(answer_text))
+            return response_expr == answer_expr
+        except Exception as e:
+            print(f'Error during simplification or parsing: {e}')
+            return response_text == answer_text
+
+
+def method_equal(response_text, answer):
+    return response_text == answer
+
+
+def method_1(response_text, answer):
+    cleaned_string = re.sub(r'[^A-Za-z]', '', response_text)
+    cleaned_string = cleaned_string.lower()
+    answer = re.sub(r'[^A-Za-z]', '', answer)
+    answer = answer.lower()
+    return cleaned_string == answer
+
+
+def method_2(response_text, answer):
+    cleaned_string = re.sub(r'[^A-Za-z]', '', response_text)
+    cleaned_string = cleaned_string.lower()
+    answer = answer.split(',')
+    return cleaned_string in answer
+
+
+def method_3(response_text, answer):
+    response_text = response_text.lower()
+    pairs1 = re.split(r'\W+', response_text)
+    pairs2 = answer.split(' ')
+    pairs1 = [word for word in pairs1 if word]
+    pairs1.sort()
+    pairs2.sort()
+    return pairs1 == pairs2
+
+
+def method_4(response_text, answer):
+    cleaned_string = re.sub(r'[^A-Za-z]', '', response_text)
+    cleaned_string = cleaned_string.lower()
+    return cleaned_string in answer
+
+
+def method_5(response_text, answer):
+    response_text = re.sub(r'\s+', '', response_text)
+    response_text = response_text.split(',')
+    answer = answer.split(',')
+    response_text.sort()
+    answer.sort()
+    return response_text == answer
+
+
+def method_9(response_text, answer):
+    response_text = response_text.replace('×', '*').replace('−', '-')
+    answer = answer.replace('×', '*').replace('−', '-')
+
+    def extract_operators(s):
+        return re.findall(r'[+\-*/]', s)
+
+    response_ops = extract_operators(response_text.split('=')[0])
+    answer_ops = extract_operators(answer.split('=')[0])
+    if response_ops != answer_ops:
+        return False
+    match = re.search(r'=\s*(-?\d+)', answer)
+    expected_result = int(match.group(1))
+    try:
+        left_side = response_text.split('=')[0]
+        result = eval(left_side)
+    except Exception as e:
+        print(f'Error during evaluation: {e}')
+        return False
+    return result == expected_result
+
+
+def method_10(response_text, answer):
+    response_text = response_text.replace('×', '*').replace('−', '-')
+    response_text = response_text.split('=')[0]
+    answer = answer.split('\n')[0].split('=')[0]
+    response_ops = sorted(remove_non_alphanumeric(response_text))
+    answer_ops = sorted(remove_non_alphanumeric(answer))
+    if response_ops != answer_ops:
+        return False
+    try:
+        result = eval(response_text)
+    except Exception as e:
+        print(f'Error during evaluation: {e}')
+        return False
+    return result == 24
+
+
+def method_18(response_text, answer):
+    cleaned_s1 = remove_commas_and_spaces(response_text)
+    cleaned_s2 = remove_commas_and_spaces(answer)
+    return cleaned_s1 == cleaned_s2
+
+
+def method_general(response_text, answer):
+    cleaned_s1 = remove_non_alphanumeric(response_text)
+    cleaned_s2 = remove_non_alphanumeric(answer)
+    return cleaned_s1 == cleaned_s2
+
+
+question_methods = {
+    '1': method_1,
+    '2': method_2,
+    '3': method_3,
+    '4': method_4,
+    '5': method_5,
+    '9': method_9,
+    '10': method_10,
+    '18': method_18,
+}
+
+
+def evaluate_response_vs_answer(response, answer, question_type, rule_id, idx):
+    if question_type == 'logic' and rule_id == '5':
+        response_text = extract_text_from_brackets(response, 'logic')
+        answer_text = extract_text_from_brackets(answer, 'logic')
+        if response_text is None:
+            return False
+        normalized_response = rule5_normalize_content(response_text)
+        normalized_answer = rule5_normalize_content(answer)
+        return normalized_response == normalized_answer
+    elif question_type == 'logic':
+        response_text = extract_text_from_brackets(response, 'logic')
+        answer_text = extract_text_from_brackets(answer, 'logic')
+        return response_text == answer_text
+    elif question_type == 'operation' and (idx == '178' or idx == '179'):
+        response_text = extract_text_from_brackets(response, 'clean')
+        response_text = extract_and_sort_inequalities(response_text)
+        answer_text = extract_and_sort_inequalities(answer)
+        # print(response_text, answer_text)
+        return response_text == answer_text
+    elif question_type == 'operation' and rule_id == '18':
+        response_text = extract_text_from_brackets(response, 'clean')
+        answer = extract_inner_text_from_brackets(answer)
+        response_text = ''.join(sorted(re.sub(r'\W+', '', response_text)))
+        answer = ''.join(sorted(re.sub(r'\W+', '', answer)))
+        return response_text == answer
+    elif question_type == 'operation' and rule_id in {'23', '24', '25'}:
+        response_text = extract_text_from_brackets(response, 'clean')
+        if response_text is None:
+            return False
+        response_text = extract_numbers(response_text)
+        answer_text = extract_numbers(answer)
+        return response_text == answer_text
+    elif question_type == 'operation' and is_in_idx_ranges(idx, idx_ranges):
+        return compare_math_expressions(response, answer)
+    elif question_type == 'operation' and contains_or(answer):
+        return compare_multi_results(response, answer)
+    elif question_type == 'puzzle':
+        response_text = extract_inner_text_from_brackets(response)
+        answer = extract_inner_text_from_brackets(answer)
+        method = question_methods.get(rule_id)
+        if method:
+            return method(response_text, answer)
+        return method_general(response_text, answer)
+    else:
+        response_text = extract_text_from_brackets(response, 'clean')
+        return response_text == answer
+
+
+def compute_one_mixed_question_pass_rate(idx,
+                                         question_list,
+                                         response_json,
+                                         base_path=None):
+    if response_json == 'NULL':
+        result_dict = {
+            'idx': idx,
+            'response': response_json,
+            'details': None,
+            'pass_rate': 0,
+            'is_correct': False
+        }
+        return result_dict
+    response_list = extract_all_responses_from_json(response_json)
+    correct_num = 0
+    results = []
+    for q_idx, question in enumerate(question_list):
+        category, question_idx = question.rsplit('_', 1)
+        question_content = load_json_or_jsonl_with_idx(base_path,
+                                                       os.path.join(
+                                                           category, 'sample'),
+                                                       idx=question_idx)
+        answer = question_content['answer']
+        if q_idx >= len(response_list):
+            break
+        response = response_list[q_idx]
+        response_text = extract_text_from_brackets(response)
+        rule_id = question_content['rule_id']
+        is_correct = evaluate_response_vs_answer(response, answer, category,
+                                                 rule_id, q_idx)
+        if is_correct:
+            correct_num += 1
+        results.append({
+            'question': question,
+            'response_text': response_text,
+            'answer': answer,
+            'is_correct': is_correct
+        })
+
+    pass_rate = correct_num / len(question_list)
+    question_correct = pass_rate == 1.0
+    result_dict = {
+        'idx': idx,
+        'response': response_json,
+        'details': results,
+        'pass_rate': pass_rate,
+        'is_correct': question_correct
+    }
+    return result_dict
+
+
+def evaluate_responses(data, mode, base_path=None):
+    results = []
+
+    # Iterate over the values of the dictionary (numerical keys)
+    for key, record in data.items():
+        idx = key  # Use the dictionary key as the "idx"
+        response = record.get('prediction', '')
+        question_type = record.get('category', '')
+        response_text = extract_text_from_brackets(response)
+        answer = record.get('gold', '')
+        rule_id = record.get('rule_id', '')
+        is_correct = evaluate_response_vs_answer(response, answer,
+                                                 question_type, rule_id, idx)
+        result_dict = {
+            'idx': idx,
+            'response': response,
+            'response_text': response_text,
+            'answer': answer,
+            'is_correct': is_correct
+        }
+        if question_type == 'counterfactual':
+            real_life_answer = record.get('real_life_answer', '')
+            is_real_life = evaluate_response_vs_answer(response,
+                                                       real_life_answer,
+                                                       question_type, rule_id,
+                                                       idx)
+            result_dict['real_life_answer'] = real_life_answer
+            result_dict['is_real_life'] = is_real_life
+        if question_type == 'cipher' and mode == 'subquestions':
+            result_dict['type'] = record.get('type', '')
+        results.append(result_dict)
+    return results
--- a/opencompass/evaluator/generic_llm_evaluator.py
+++ b/opencompass/evaluator/generic_llm_evaluator.py
@ -1,3 +1,4 @@
+import os
 import os.path as osp
 from typing import Dict, List, Optional

@ -36,7 +37,11 @@ class GenericLLMEvaluator(BaseEvaluator):
    ) -> None:

        self.logger = get_logger()
-        self.judge_cfg = judge_cfg
+        # If judge_cfg is not provided, fall back to the default configuration
+        if not judge_cfg:
+            self.judge_cfg = self.default_judge_cfg
+        else:
+            self.judge_cfg = judge_cfg
        self.output_path = ''

        self.prompt_template = ICL_PROMPT_TEMPLATES.build(prompt_template)
@ -141,3 +146,30 @@ class GenericLLMEvaluator(BaseEvaluator):
            kwargs = self.dict_postprocessor
            proc = DICT_POSTPROCESSORS.get(kwargs.pop('type'))
            return proc(output, self.output_path, **kwargs)
+
+    @property
+    def default_judge_cfg(self):
+        from opencompass.models import OpenAISDK
+
+        DEFAULT_JUDGE_CFG = dict(
+            type=OpenAISDK,
+            path=os.environ['OC_JUDGE_MODEL'],
+            key=os.environ['OC_JUDGE_API_KEY'],
+            openai_api_base=[
+                os.environ.get('OC_JUDGE_API_BASE',
+                               'https://api.openai.com/v1/')
+            ],
+            meta_template=dict(round=[
+                dict(role='HUMAN', api_role='HUMAN'),
+                dict(role='BOT', api_role='BOT', generate=True),
+            ], ),
+            query_per_second=16,
+            batch_size=1024,
+            temperature=0.001,
+            tokenizer_path='gpt-4o-2024-05-13',
+            verbose=True,
+            max_out_len=16384,
+            max_seq_len=49152,
+        )
+
+        return DEFAULT_JUDGE_CFG
--- a/opencompass/models/openai_api.py
+++ b/opencompass/models/openai_api.py
@ -399,7 +399,7 @@ class OpenAI(BaseAPIModel):
                self.logger.info(
                    f'Successfully load default tiktoken tokenizer: '
                    f' {default_tokenizer}')
-            return len(enc.encode(prompt))
+            return len(enc.encode(prompt, disallowed_special=()))

    def _bin_trim(self, prompt: str, num_token: int, mode: str) -> str:
        """Get a suffix of prompt which is no longer than num_token tokens.
--- a/opencompass/openicl/icl_evaluator/init.py
+++ b/opencompass/openicl/icl_evaluator/init.py
@ -12,3 +12,4 @@ from .icl_misc_evaluator import AveragePPLEvaluator  # noqa
 from .icl_plugin_evaluator import TEvalEvaluator  # noqa
 from .icl_toxic_evaluator import ToxicEvaluator  # noqa
 from .lm_evaluator import LMEvaluator  # noqa
+from .math_evaluator import MATHEvaluator  # noqa
--- a/opencompass/openicl/icl_evaluator/code_evaluator.py
+++ b/opencompass/openicl/icl_evaluator/code_evaluator.py
@ -0,0 +1,267 @@
+# flake8: noqa: E501
+
+import difflib
+import os
+import re
+import tempfile
+import time
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+from datasets import Dataset
+from gradio_client import Client
+
+from opencompass.openicl.icl_evaluator import BaseEvaluator
+from opencompass.registry import ICL_EVALUATORS
+
+
+@ICL_EVALUATORS.register_module()
+class CodeEvaluator(BaseEvaluator):
+    """Evaluator for code generation tasks.
+
+    This evaluator sends code to a remote evaluation service to test its
+    functionality against provided test cases. It handles code extraction,
+    processing, and result analysis.
+    """
+
+    def __init__(self,
+                 language: str,
+                 ip_address: str = 'localhost',
+                 retry: int = 3) -> None:
+        """Initialize the CodeEvaluator.
+
+        Args:
+            language (str): Programming language of the code to evaluate.
+            ip_address (str, optional): IP address of the evaluation service. Defaults to 'localhost'.
+            retry (int, optional): Number of retry attempts for failed connections. Defaults to 3.
+        """
+        self.language = language
+        self.retry = retry
+        self.client = Client(ip_address)
+        super().__init__()
+
+    def _extract_code(self, text: str) -> str:
+        """Extract code from markdown-formatted text.
+
+        Args:
+            text (str): Text that may contain code blocks in markdown format.
+
+        Returns:
+            str: Extracted code from the last code block, or the original text if no code blocks found.
+        """
+        blocks = re.findall(r'```\w*\n(.*?)```', text, re.DOTALL)
+        if len(blocks) >= 1:
+            text = blocks[0]
+        return text
+
+    def _code_eval_service(
+        self, input_data: Union[Dict, List,
+                                str]) -> Tuple[bool, Union[Dict, List, Any]]:
+        """Send code to the remote evaluation service using gradio_client and
+        get the results.
+
+        Args:
+            input_data: Can be one of:
+                - dict: Dictionary containing code information for a single test case
+                - list: List of dictionaries for batch evaluation
+                - str: File path to code file
+
+        Returns:
+            tuple: (succeed, output)
+                - succeed (bool): Whether the request was successful
+                - output (dict/list/str): Evaluation results or error message
+        """
+        try:
+            temp_file_path = None
+            # Handle file path input
+            if isinstance(input_data, str):
+                with tempfile.NamedTemporaryFile(suffix=f'.{self.language}',
+                                                 delete=False) as temp_file:
+                    temp_file_path = temp_file.name
+                    with open(input_data, 'r') as src_file:
+                        content = src_file.read()
+                    temp_file.write(content.encode())
+                input_data = temp_file_path
+
+            # Send to evaluation service
+            result = self.client.predict(input_data, api_name='/evaluate')
+
+            # Process the result
+            if isinstance(result, (dict, list)):
+                return True, result
+            else:
+                # Try to parse the result as JSON if it's a string
+                try:
+                    import json
+                    parsed_result = json.loads(result)
+                    return True, parsed_result
+                except:  # noqa: E722
+                    return True, {'status': 'unknown', 'raw_result': result}
+
+        except Exception as e:
+            return False, str(e)
+        finally:
+            # Clean up temporary file if it was created
+            if temp_file_path and os.path.exists(temp_file_path):
+                try:
+                    os.unlink(temp_file_path)
+                except:  # noqa: E722
+                    pass
+
+    def _remove_prefix(self,
+                       prompt: str,
+                       completion: str,
+                       threshold: float = 0.95) -> str:
+        """Determine the truncation point in the completion based on the last
+        line of the prompt, remove all content before that line in the
+        completion, and return the completion string after removing the prefix.
+        This is done to convert chatbot-style inference mode to completion
+        mode.
+
+        Args:
+            prompt (str): The prompt text.
+            completion (str): The completion text.
+            threshold (float): Line similarity threshold.
+
+        Returns:
+            str: The completion string after removing the prefix.
+        """
+        prompt_lines = prompt.splitlines()
+        completion_lines = completion.splitlines()
+
+        if not prompt_lines:
+            return completion
+
+        last_prompt_line = prompt_lines[-1]
+        cut_index = -1
+
+        for i, completion_line in enumerate(completion_lines):
+            similarity = difflib.SequenceMatcher(None, last_prompt_line,
+                                                 completion_line).ratio()
+            if similarity >= threshold:
+                cut_index = i
+                break
+
+        if cut_index != -1:
+            return '\n'.join(completion_lines[cut_index + 1:])
+        else:
+            return completion
+
+    def _process_completions(self, test_case: dict, completions: list) -> list:
+        """Process code completion list, which typically involves extracting
+        code, removing repetitive prefixes caused by chatbot mode, and other
+        steps to ensure the model-generated code can be compiled successfully.
+
+        Args:
+            test_case (dict): Dictionary containing test case information including:
+            completions (list): List of code completions generated by the model.
+
+        Returns:
+            list: Processed code completion list.
+        """
+        processed_completions = []
+        for comp in completions:
+            comp = self._extract_code(comp)
+            post_comp = self._remove_prefix(test_case['prompt'], comp)
+            processed_completions.append(post_comp)
+        return processed_completions
+
+    def _evaluate(
+        self, input_data: Union[Dict, List]
+    ) -> Tuple[bool, Optional[Union[Dict, List]], Optional[str]]:
+        """Evaluate code with retry mechanism.
+
+        Args:
+            input_data: Can be either:
+                - dict: Dictionary containing code and test information for a single test case
+                - list: List of dictionaries for batch evaluation
+
+        Returns:
+            tuple: (success, output, error_message)
+                - success (bool): Whether the evaluation was successful
+                - output (dict or list): Evaluation output (if successful)
+                - error_message (str): Error message (if failed)
+        """
+        num_retry = 0
+        while num_retry < self.retry:
+            succeed, output = self._code_eval_service(input_data)
+            if not succeed:
+                num_retry += 1
+                time.sleep(10)
+            else:
+                break
+
+        if not succeed:
+            return False, None, f'code eval service connection failed: {output}'
+
+        return True, output, None
+
+    def score(self, predictions: List, references: List,
+              test_set: Dataset) -> Dict:
+        """Score code generation predictions against references.
+
+        Args:
+            predictions (list): List of model-generated code completions.
+            references (list): List of reference solutions (not directly used in evaluation).
+            test_set (Dataset): Dataset containing test cases and other metadata.
+
+        Returns:
+            dict: Evaluation results including:
+                - accuracy: Percentage of correctly solved problems
+                - details: Detailed results for each test case
+                - error: Error message if evaluation failed
+        """
+        if len(predictions) != len(references):
+            return {
+                'error':
+                'predictions and references have different '
+                f'length. len(predictions): {len(predictions)}, '
+                f'len(references): {len(references)}'
+            }
+
+        test_set = test_set.to_pandas()
+        # Use the first column as the unique identifier
+        test_set_origin = test_set.drop_duplicates(subset=test_set.columns[0])
+        num_repeats = int(len(test_set) / len(test_set_origin))
+
+        # 1. Prepare data for all test cases
+        all_test_cases = []
+        for i in range(len(test_set_origin)):
+            test_case = test_set_origin.iloc[i]
+            completions = predictions[i * num_repeats:(i + 1) * num_repeats]
+
+            # Process code completions
+            processed_completions = self._process_completions(
+                test_case, completions)
+
+            result_dict = {
+                'name': test_case['name'],
+                'language': test_case['language'],
+                'prompt': test_case['prompt'],
+                'tests': test_case['tests'],
+                'processed_completions': processed_completions,
+                'completions': completions
+            }
+
+            all_test_cases.append(result_dict)
+
+        # 2. Send all test cases to the evaluation service
+        success, outputs, error_message = self._evaluate(all_test_cases)
+        if not success:
+            return {'error': error_message}
+
+        # 3. Process the returned results
+        details = []
+        correct = 0
+        for output in outputs:
+            if output.get('status') == 'OK':
+                output['correct'] = True
+                correct += 1
+            else:
+                output['correct'] = False
+
+            details.append(output)
+
+        return {
+            f'pass@{num_repeats}': 100 * correct / len(test_set_origin),
+            'details': details
+        }
--- a/opencompass/openicl/icl_evaluator/icl_base_evaluator.py
+++ b/opencompass/openicl/icl_evaluator/icl_base_evaluator.py
@ -1,4 +1,5 @@
 """Base Evaluator."""
+
 from collections import OrderedDict
 from copy import deepcopy
 from typing import Any, Dict, Iterable, List, Union
@ -77,12 +78,17 @@ class BaseEvaluator:
        for metric in all_metrics:
            if metric in ['predictions', 'example_abbr']:
                continue
-            g_passk_details[metric] = 100. * np.mean(
+            g_passk_details[metric] = 100.0 * np.mean(
                [detail[metric] for detail in details])
        return g_passk_details

-    def evaluate(self, k: Union[int, List[int]], n: int,
-                 original_dataset: Dataset, **score_kwargs):
+    def evaluate(
+        self,
+        k: Union[int, List[int]],
+        n: int,
+        original_dataset: Dataset,
+        **score_kwargs,
+    ):
        real_size = len(original_dataset) // n
        all_details = []
        all_results = []
@ -146,7 +152,7 @@ class BaseEvaluator:

                if can_calculate and n > 1 and k > 1:
                    thresholds = [0.0, 0.25, 0.5, 0.75, 1.0]
-                    for _k in ([k] if isinstance(k, int) else k):
+                    for _k in [k] if isinstance(k, int) else k:
                        for threshold in thresholds:
                            g_pass = compute_g_pass_at_k(n=n,
                                                         c=c,
@ -161,9 +167,31 @@ class BaseEvaluator:

            if can_calculate and n > 1 and k > 1:
                eval_results.update(self.reduce(eval_details))
+
+            # Store eval_details in eval_results
            eval_results['details'] = eval_details

-        return eval_results
+            # Process details to flatten the predictions
+            for detail in eval_details:
+                # Extract all prediction fields and flatten them
+                flattened_predictions = {}
+                for pred in detail['predictions']:
+                    for k, v in pred.items():
+                        if k not in flattened_predictions:
+                            flattened_predictions[k] = [v]
+                        else:
+                            flattened_predictions[k].append(v)
+
+                # Replace the predictions list with the flattened dictionary
+                for k, v in flattened_predictions.items():
+                    detail[k] = v
+
+                # Remove the original predictions field
+                detail.pop('predictions')
+            return eval_results
+
+        # If there are no details, return results
+        return results

    def score(self):
        raise NotImplementedError("Method hasn't been implemented yet")
--- a/opencompass/openicl/icl_evaluator/math_evaluator.py
+++ b/opencompass/openicl/icl_evaluator/math_evaluator.py
@ -1,7 +1,3 @@
-from latex2sympy2_extended import NormalizationConfig
-from math_verify import (ExprExtractionConfig, LatexExtractionConfig, parse,
-                         verify)
-
 from opencompass.openicl.icl_evaluator import BaseEvaluator
 from opencompass.registry import ICL_EVALUATORS

@ -10,6 +6,14 @@ from opencompass.registry import ICL_EVALUATORS
 class MATHEvaluator(BaseEvaluator):

    def score(self, predictions, references):
+        try:
+            from latex2sympy2_extended import NormalizationConfig
+            from math_verify import (ExprExtractionConfig,
+                                     LatexExtractionConfig, parse, verify)
+        except ImportError:
+            raise ImportError('Failed to import required modules. Please '
+                              'install the necessary packages: '
+                              'pip install math_verify latex2sympy2_extended')

        self.is_num_equal(predictions, references)

@ -75,7 +79,7 @@ class MATHEvaluator(BaseEvaluator):

 if __name__ == '__main__':
    import sympy
-
+    from math_verify import parse
    test_cases = [
        # 1. Basic arithmetic operations
        r'Simple fraction: \boxed{\frac{1}{2}}',
--- a/opencompass/runners/volc.py
+++ b/opencompass/runners/volc.py
@ -256,7 +256,7 @@ class VOLCRunner(BaseRunner):
        with open(config_path) as fp:
            volc_cfg = yaml.safe_load(fp)
        if num_gpus <= 0:
-            flavor = 'ml.c3i.2xlarge'
+            flavor = 'ml.r3i.2xlarge'
        elif num_gpus == 1:
            flavor = 'ml.pni2l.3xlarge'
        elif num_gpus == 2:
--- a/opencompass/summarizers/default.py
+++ b/opencompass/summarizers/default.py
@ -171,6 +171,8 @@ class DefaultSummarizer:
                        default_metric = 'sum'
                    elif sg.get('weights', []):
                        default_metric = 'weighted_average'
+                    elif sg.get('harmonic_mean', False):
+                        default_metric = 'harmonic_mean'
                    else:
                        default_metric = 'naive_average'

@ -186,24 +188,35 @@ class DefaultSummarizer:
                        eval_modes.append(dataset_eval_mode.get(dataset_abbr, 'unknown'))
                else:
                    group_metrics = list(functools.reduce(lambda a, b: a & b, [set(dataset_metrics[dataset_abbr]) for dataset_abbr in sg['subsets']]))
-                    if need_smart_metric and len(group_metrics) > 1:
-                        for metric in group_metrics:
-                            for dataset_abbr in sg['subsets']:
-                                scores.setdefault(metric, {})[dataset_abbr + '@' + metric] = parsed_results[model_abbr][dataset_abbr][metric]
-                                eval_modes.append(dataset_eval_mode.get(sg['subsets'][0], 'unknown'))
-                    else:
-                        group_metrics = [default_metric]
+                    group_metrics.append(default_metric)
+                    for metric in group_metrics:
                        for dataset_abbr in sg['subsets']:
-                            metric = dataset_metrics[dataset_abbr][0]
-                            scores.setdefault(default_metric, {})[dataset_abbr + '@' + metric] = parsed_results[model_abbr][dataset_abbr][metric]
-                            eval_modes.append(dataset_eval_mode.get(dataset_abbr, 'unknown'))
-
+                            if metric == default_metric:
+                                metric_default = dataset_metrics[dataset_abbr][0]
+                                scores.setdefault(default_metric, {})[dataset_abbr + '@' + metric_default] = \
+                                parsed_results[model_abbr][dataset_abbr][metric_default]
+                                eval_modes.append(dataset_eval_mode.get(dataset_abbr, 'unknown'))
+                            else:
+                                scores.setdefault(metric, {})[dataset_abbr + '@' + metric] = \
+                                parsed_results[model_abbr][dataset_abbr][metric]
+                                eval_modes.append(dataset_eval_mode.get(sg['subsets'][0], 'unknown'))
                result = {}
                for metric in scores:
                    if default_metric == 'standard_deviation':
                        avg = sum(scores[metric].values()) / len(scores[metric])
                        variance = sum((scores[metric][k] - avg) ** 2 for k in scores[metric]) / len(scores[metric])
                        scores[metric] = result[metric] = math.sqrt(variance)
+                    elif default_metric == 'harmonic_mean':
+                        # Check for non-positive values that would cause issues in harmonic mean
+                        if any(scores[metric][k] <= 0 for k in scores[metric]):
+                            self.logger.warning(f'Non-positive values found when calculating harmonic mean for {sg["name"]}')
+                            # Handle non-positive values (either skip or use a small positive value)
+                            numerator = len(scores[metric])
+                            denominator = sum(1 / max(scores[metric][k], 1) for k in scores[metric])
+                        else:
+                            numerator = len(scores[metric])
+                            denominator = sum(1 / scores[metric][k] for k in scores[metric])
+                        scores[metric] = result[metric] = numerator / denominator
                    else:
                        if sg.get('weights', []):
                            # check sg['weights'][k] != 0 in case of scores[metric][k] is NaN
--- a/opencompass/tasks/openicl_eval.py
+++ b/opencompass/tasks/openicl_eval.py
@ -263,28 +263,34 @@ class OpenICLEvalTask(BaseTask):

            if self.dump_details:
                details = result.get('details', None)
-                try:
-                    result['details'] = self.format_details(
-                        pred_strs,
-                        model_pred_strs,
-                        test_set[self.output_column],
-                        details,
-                        model_details,
-                        pred_dicts,
-                    )
-                    self.logger.warning(
-                        f"result['details'] : {result['details']}"),
-                    result['type'] = result['details'].pop('type', None)
-                    if self.cal_extract_rate:
-                        # Calculate the extraction success rate for prediction
-                        result['extract_rate'] = self.extract_rate(result)
+                # Try to format details is details is not provided by evaluator
+                if details is None:
+                    self.logger.info(
+                        'Details is not give by evaluator, try to format it')
+                    try:
+                        result['details'] = self.format_details(
+                            pred_strs,
+                            model_pred_strs,
+                            test_set[self.output_column],
+                            details,
+                            model_details,
+                            pred_dicts,
+                        )
+                        self.logger.warning(
+                            f"result['details'] : {result['details']}"),
+                        result['type'] = result['details'].pop('type', None)
+                        if self.cal_extract_rate:
+                            # Calculate the extraction success
+                            # rate for prediction
+                            result['extract_rate'] = self.extract_rate(result)

-                    if 'PPL' in str(
-                            self.dataset_cfg.infer_cfg.inferencer.type):
-                        result['correct_bpb'], result['incorrect_bpb'] = (
-                            self.calculate_bpb(pred_dicts))
-                except Exception as e:
-                    self.logger.warning(f'Skip dumping details due to: {e}.')
+                        if 'PPL' in str(
+                                self.dataset_cfg.infer_cfg.inferencer.type):
+                            result['correct_bpb'], result['incorrect_bpb'] = (
+                                self.calculate_bpb(pred_dicts))
+                    except Exception as e:
+                        self.logger.warning(
+                            f'Skip dumping details due to: {e}.')
            else:
                result.pop('details', None)

--- a/opencompass/utils/datasets_info.py
+++ b/opencompass/utils/datasets_info.py
@ -33,6 +33,12 @@ DATASETS_MAPPING = {
        "hf_id": "opencompass/bbh",
        "local": "./data/BBH/data",
    },
+    # bbeh
+    "opencompass/bbeh": {
+        "ms_id": "",
+        "hf_id": "",
+        "local": "./data/bbeh/",
+    },
    # C-Eval
    "opencompass/ceval-exam": {
        "ms_id": "opencompass/ceval-exam",
@ -187,6 +193,12 @@ DATASETS_MAPPING = {
        "hf_id": "",
        "local": "./data/mmlu_pro",
    },
+    # MultiPL-E
+    "opencompass/multipl_e": {
+        "ms_id": "",
+        "hf_id": "",
+        "local": "./data/multipl_e",
+    },
    # NQ
    "opencompass/natural_question": {
        "ms_id": "opencompass/natural_question",
@ -303,6 +315,11 @@ DATASETS_MAPPING = {
        "hf_id": "",
        "local": "./data/aime.jsonl",
    },
+    "opencompass/aime2025": {
+        "ms_id": "",
+        "hf_id": "",
+        "local": "./data/aime2025/aime2025.jsonl",
+    },
    "opencompass/cmo_fib": {
        "ms_id": "",
        "hf_id": "",
@ -616,6 +633,11 @@ DATASETS_URL = {
        "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu_pro.zip",
        "md5": "e3200c7380f4cea5f13c768f2815fabb",
    },
+    "multipl_e": {
+        "url":
+        "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/multipl_e.zip",
+        "md5": "24462aac7a38a4a62f5c5e89eb614e20",
+    },
    "/Longbench": {
        "url":
        "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/Longbench.zip",
@ -646,11 +668,16 @@ DATASETS_URL = {
        "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/test_generation.zip",
        "md5": "918a6ea2b1eee6f2b1314db3c21cb4c7",
    },
-    "/aime": {
+    "/aime2024": {
        "url":
        "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip",
        "md5": "fbe2d0577fc210962a549f8cea1a00c8",
    },
+    "/aime2025": {
+        "url":
+        "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime2025.zip",
+        "md5": "aa18cd5d2e2de246c5397f5eb1e61004",
+    },
    "/cmo": {
        "url":
        "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/cmo.zip",
@ -691,6 +718,10 @@ DATASETS_URL = {
        "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/korbench.zip",
        "md5": "9107597d137e7362eaf7d218ddef7a6d",
    },
+    "/bbeh": {
+        "url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/bbeh.zip",
+        "md5": "43a3c2d73aee731ac68ac790bc9a358e",
+    },
    "subjective/judgerbench": {
        "url":
        "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/judgerbench.zip",
--- a/opencompass/utils/run.py
+++ b/opencompass/utils/run.py
@ -276,13 +276,15 @@ def change_accelerator(models, accelerator):
                    if model.get(item) is not None:
                        acc_model[item] = model[item]
            elif accelerator == 'vllm':
+                model_kwargs = dict(tensor_parallel_size=model['run_cfg']['num_gpus'], max_model_len=model.get('max_seq_len', None))
+                model_kwargs.update(model.get('model_kwargs'))
                logger.info(f'Transforming {model["abbr"]} to {accelerator}')

                acc_model = dict(
                    type=f'{VLLM.__module__}.{VLLM.__name__}',
                    abbr=model['abbr'].replace('hf', 'vllm') if '-hf' in model['abbr'] else model['abbr'] + '-vllm',
                    path=model['path'],
-                    model_kwargs=dict(tensor_parallel_size=model['run_cfg']['num_gpus'], max_model_len=model.get('max_seq_len', None)),
+                    model_kwargs=model_kwargs,
                    max_out_len=model['max_out_len'],
                    max_seq_len=model.get('max_seq_len', None),
                    batch_size=model['batch_size'],
@ -296,12 +298,14 @@ def change_accelerator(models, accelerator):
                raise ValueError(f'Unsupported accelerator {accelerator} for model type {model["type"]}')
        elif model['type'] in [HuggingFacewithChatTemplate, f'{HuggingFacewithChatTemplate.__module__}.{HuggingFacewithChatTemplate.__name__}']:
            if accelerator == 'vllm':
+                model_kwargs = dict(tensor_parallel_size=model['run_cfg']['num_gpus'], max_model_len=model.get('max_seq_len', None))
+                model_kwargs.update(model.get('model_kwargs'))
                mod = VLLMwithChatTemplate
                acc_model = dict(
                    type=f'{mod.__module__}.{mod.__name__}',
                    abbr=model['abbr'].replace('hf', 'vllm') if '-hf' in model['abbr'] else model['abbr'] + '-vllm',
                    path=model['path'],
-                    model_kwargs=dict(tensor_parallel_size=model['run_cfg']['num_gpus'], max_model_len=model.get('max_seq_len', None)),
+                    model_kwargs=model_kwargs,
                    max_seq_len=model.get('max_seq_len', None),
                    max_out_len=model['max_out_len'],
                    batch_size=16,
@ -309,6 +313,14 @@ def change_accelerator(models, accelerator):
                    stop_words=model.get('stop_words', []),
                )
            elif accelerator == 'lmdeploy':
+
+                if model.get('generation_kwargs') is not None:
+                    logger.warning(f'LMDeploy uses do_sample=False as default, and you need to set do_sample=True for sampling mode')
+                    gen_config = model['generation_kwargs'].copy()
+                else:
+                    logger.info('OpenCompass uses greedy decoding as default, you can set generation-kwargs for your purpose')
+                    gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9)
+
                mod = TurboMindModelwithChatTemplate
                acc_model = dict(
                    type=f'{mod.__module__}.{mod.__name__}',
@ -320,7 +332,7 @@ def change_accelerator(models, accelerator):
                        session_len=model.get('max_seq_len', None),
                        max_new_tokens=model['max_out_len']
                    ),
-                    gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9),
+                    gen_config=gen_config,
                    max_seq_len=model.get('max_seq_len', None),
                    max_out_len=model['max_out_len'],
                    batch_size=16,
--- a/requirements/extra.txt
+++ b/requirements/extra.txt
@ -12,7 +12,7 @@ faiss_gpu==1.7.2
 # IFEval
 langdetect
 # TheoremQA
-latex2sympy2
+latex2sympy2==1.9.1
 # Lawbench, leval
 ltp
 # Math