[Update] Add CascadeEvaluator with Data Replica (#2022)

* Update CascadeEvaluator * Update CascadeEvaluator * Update CascadeEvaluator * Update Config * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update
2025-05-30 16:03:24 +08:00 · 2025-05-20 16:46:55 +08:00 · 2025-05-20 16:46:55 +08:00 · aa2b89b6f8
commit aa2b89b6f8
parent 7a7a4517ab
43 changed files with 1471 additions and 269 deletions
--- a/README.md
+++ b/README.md
@ -60,7 +60,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
 - **\[2025.04.01\]** OpenCompass now supports `CascadeEvaluator`, a flexible evaluation mechanism that allows multiple evaluators to work in sequence. This enables creating customized evaluation pipelines for complex assessment scenarios. Check out the [documentation](docs/en/advanced_guides/llm_judge.md) for more details! 🔥🔥🔥
 - **\[2025.03.11\]** We have supported evaluation for `SuperGPQA` which is a great benchmark for measuring LLM knowledge ability 🔥🔥🔥
 - **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
+- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHVerifyEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
 - **\[2025.01.16\]** We now support the [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) model which has enhanced performance on reasoning and knowledge-intensive tasks.
 - **\[2024.12.17\]** We have provided the evaluation script for the December [CompassAcademic](examples/eval_academic_leaderboard_202412.py), which allows users to easily reproduce the official evaluation results by configuring it.
 - **\[2024.11.14\]** OpenCompass now offers support for a sophisticated benchmark designed to evaluate complex reasoning skills — [MuSR](https://arxiv.org/pdf/2310.16049). Check out the [demo](examples/eval_musr.py) and give it a spin! 🔥🔥🔥
@ -246,7 +246,7 @@ Currently, OpenCompass have provided standard recommended configurations for dat
 opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat

 # Recommended Evaluation Config based on LLM Judge
-opencompass --datasets aime2024_llm_judge_gen --models hf_internlm2_5_1_8b_chat
+opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
 ```

 If you want to use multiple GPUs to evaluate the model in data parallel, you can use `--max-num-worker`.
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -60,7 +60,7 @@
 - **\[2025.04.01\]** OpenCompass 现已支持 `CascadeEvaluator`，允许多个评估器按顺序工作，可以为更复杂的评估场景创建自定义评估流程，查看[文档](docs/zh_cn/advanced_guides/llm_judge.md)了解具体用法！🔥🔥🔥
 - **\[2025.03.11\]** 现已支持 `SuperGPQA`  覆盖285 个研究生学科的知识能力评测，欢迎尝试！🔥🔥🔥
 - **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程，请查看 [评估推理模型](docs/zh_cn/user_guides/deepseek_r1.md) 了解更多详情！🔥🔥🔥
- **\[2025.02.15\]** 我们新增了两个实用的评测工具：用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情！🔥🔥🔥
+- **\[2025.02.15\]** 我们新增了两个实用的评测工具：用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHVerifyEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情！🔥🔥🔥
 - **\[2025.01.16\]** 我们现已支持 [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) 模型，该模型在推理、知识类任务上取得同量级最优性能，欢迎尝试。
 - **\[2024.12.17\]** 我们提供了12月CompassAcademic学术榜单评估脚本 [CompassAcademic](configs/eval_academic_leaderboard_202412.py)，你可以通过简单地配置复现官方评测结果。
 - **\[2024.10.14\]** 现已支持OpenAI多语言问答数据集[MMMLU](https://huggingface.co/datasets/openai/MMMLU)，欢迎尝试! 🔥🔥🔥
@ -237,7 +237,7 @@ humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ce
  opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat

  # 基于LLM Judge的推荐配置
-  opencompass --datasets aime2024_llm_judge_gen --models hf_internlm2_5_1_8b_chat
+  opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
  ```

  此外，如果你想在多块 GPU 上使用模型进行推理，您可以使用 `--max-num-worker` 参数。
--- a/dataset-index.yml
+++ b/dataset-index.yml
@ -303,7 +303,7 @@
    category: Examination
    paper: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
    configpath: opencompass/configs/datasets/aime2024/aime2024_gen.py
-    configpath_llmjudge: opencompass/configs/datasets/aime2024/aime2024_llm_judge_gen.py
+    configpath_llmjudge: opencompass/configs/datasets/aime2024/aime2024_llmjudge_gen.py
 - anli:
    name: Adversarial NLI
    category: Reasoning
--- a/docs/en/advanced_guides/llm_judge.md
+++ b/docs/en/advanced_guides/llm_judge.md
@ -278,7 +278,7 @@ Here's an example of how to configure the CascadeEvaluator:

 ```python
 # Define a rule-based evaluator
-rule_evaluator = dict(type=MATHEvaluator)
+rule_evaluator = dict(type=MATHVerifyEvaluator)

 # Define an LLM judge evaluator
 llm_judge_evaluator = dict(
--- a/docs/en/advanced_guides/math_verify.md
+++ b/docs/en/advanced_guides/math_verify.md
@ -2,7 +2,7 @@

 ## Introduction

-Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHEvaluator components.
+Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHVerifyEvaluator components.

 ## Dataset Format

@ -61,7 +61,7 @@ math_infer_cfg = dict(

 ```python
 math_eval_cfg = dict(
-    evaluator=dict(type=MATHEvaluator),
+    evaluator=dict(type=MATHVerifyEvaluator),
 )
 ```

@ -86,11 +86,11 @@ math_datasets = [
 ]
 ```

-## MATHEvaluator
+## MATHVerifyEvaluator

-The MATHEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.
+The MATHVerifyEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.

-The MATHEvaluator implements:
+The MATHVerifyEvaluator implements:

 1. Extracts answers from both predictions and references using LaTeX extraction
 2. Handles various LaTeX formats and environments
@ -133,7 +133,7 @@ Here's a complete example of how to set up math evaluation:
 from mmengine.config import read_base
 from opencompass.models import TurboMindModelwithChatTemplate
 from opencompass.datasets import CustomDataset
-from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
+from opencompass.openicl.icl_evaluator.math_evaluator import MATHVerifyEvaluator
 from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
@ -160,7 +160,7 @@ math_infer_cfg = dict(

 # Evaluation configuration
 math_eval_cfg = dict(
-    evaluator=dict(type=MATHEvaluator),
+    evaluator=dict(type=MATHVerifyEvaluator),
 )

 # Dataset configuration
--- a/docs/zh_cn/advanced_guides/llm_judge.md
+++ b/docs/zh_cn/advanced_guides/llm_judge.md
@ -277,7 +277,7 @@ OpenCompass还提供了级联评估器`CascadeEvaluator`，它结合了规则式

 ```python
 # 定义规则式评估器
-rule_evaluator = dict(type=MATHEvaluator)
+rule_evaluator = dict(type=MATHVerifyEvaluator)

 # 定义LLM评判器
 llm_judge_evaluator = dict(
--- a/docs/zh_cn/advanced_guides/math_verify.md
+++ b/docs/zh_cn/advanced_guides/math_verify.md
@ -2,7 +2,7 @@

 ## 简介

-数学推理能力是大语言模型(LLMs)的一项关键能力。为了评估模型的数学能力，我们需要测试其逐步解决数学问题并提供准确最终答案的能力。OpenCompass 通过 CustomDataset 和 MATHEvaluator 组件提供了一种便捷的数学推理评测方式。
+数学推理能力是大语言模型(LLMs)的一项关键能力。为了评估模型的数学能力，我们需要测试其逐步解决数学问题并提供准确最终答案的能力。OpenCompass 通过 CustomDataset 和 MATHVerifyEvaluator 组件提供了一种便捷的数学推理评测方式。

 ## 数据集格式

@ -61,7 +61,7 @@ math_infer_cfg = dict(

 ```python
 math_eval_cfg = dict(
-    evaluator=dict(type=MATHEvaluator),
+    evaluator=dict(type=MATHVerifyEvaluator),
 )
 ```

@ -86,11 +86,11 @@ math_datasets = [
 ]
 ```

-## MATHEvaluator
+## MATHVerifyEvaluator

-MATHEvaluator 是专门设计用于评估数学答案的评测器。它基于 math_verify 库进行开发，该库提供了数学表达式解析和验证功能，支持 LaTeX 和一般表达式的提取与等价性验证。
+MATHVerifyEvaluator 是专门设计用于评估数学答案的评测器。它基于 math_verify 库进行开发，该库提供了数学表达式解析和验证功能，支持 LaTeX 和一般表达式的提取与等价性验证。

-MATHEvaluator 具有以下功能：
+MATHVerifyEvaluator 具有以下功能：

 1. 使用 LaTeX 提取器从预测和参考答案中提取答案
 2. 处理各种 LaTeX 格式和环境
@ -133,7 +133,7 @@ MATHEvaluator 具有以下功能：
 from mmengine.config import read_base
 from opencompass.models import TurboMindModelwithChatTemplate
 from opencompass.datasets import CustomDataset
-from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
+from opencompass.evaluator import MATHVerifyEvaluator
 from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
@ -160,7 +160,7 @@ math_infer_cfg = dict(

 # 评测配置
 math_eval_cfg = dict(
-    evaluator=dict(type=MATHEvaluator),
+    evaluator=dict(type=MATHVerifyEvaluator),
 )

 # 数据集配置
--- a/examples/eval_cascade_evaluator.py
+++ b/examples/eval_cascade_evaluator.py
@ -7,9 +7,12 @@ from opencompass.openicl.icl_inferencer import GenInferencer
 from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
-from opencompass.evaluator import GenericLLMEvaluator, CascadeEvaluator
+from opencompass.evaluator import (
+    GenericLLMEvaluator,
+    CascadeEvaluator,
+    MATHVerifyEvaluator,
+)
 from opencompass.datasets import generic_llmjudge_postprocess
-from opencompass.openicl.icl_evaluator import MATHEvaluator
 from opencompass.datasets import (
    MATHDataset,
    math_postprocess_v2,
@ -94,7 +97,7 @@ llm_judge_evaluator =   dict(
        judge_cfg=dict(),
    )

-rule_evaluator =dict(type=MATHEvaluator)
+rule_evaluator =dict(type=MATHVerifyEvaluator)
 cascade_evaluator = dict(type=CascadeEvaluator,
                   llm_evaluator=llm_judge_evaluator,
                   rule_evaluator=rule_evaluator,
--- a/examples/eval_qwen3.py
+++ b/examples/eval_qwen3.py
@ -0,0 +1,142 @@
+
+import os.path as osp
+from opencompass.models import OpenAISDK
+from mmengine.config import read_base
+from opencompass.utils.text_postprocessors import extract_non_reasoning_content
+from opencompass.runners import LocalRunner
+from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
+from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
+
+with read_base():
+    from opencompass.configs.datasets.aime2024.aime2024_cascade_eval_gen_5e9f4f import aime2024_datasets
+    from opencompass.configs.datasets.aime2025.aime2025_cascade_eval_gen_5e9f4f import aime2025_datasets
+    from opencompass.configs.datasets.math.math_500_cascade_eval_gen_6ff468 import math_datasets
+
+#######################################################################
+#                          PART 0  Meta Info                          #
+#######################################################################
+
+
+api_meta_template = dict(round=[
+    dict(role='HUMAN', api_role='HUMAN'),
+    dict(role='BOT', api_role='BOT', generate=True),
+], 
+)
+
+
+judge_cfg = dict(
+        abbr='qwen2-5-32B-Instruct',
+        type=OpenAISDK,
+        path='Qwen/Qwen2.5-32B-Instruct',
+        key='sk-1234',
+        openai_api_base=[
+            'http://x.x.x.x:4000/v1',
+        ],
+        meta_template=api_meta_template,
+        query_per_second=8,
+        batch_size=256,
+        temperature=0.001,
+        # max_completion_tokens=32768,
+        tokenizer_path='gpt-4o-2024-05-13',
+        # verbose=True,
+        max_out_len=16384,
+        max_seq_len=32768,
+        # max_seq_len=49152,
+        mode='mid',
+        retry=10
+)
+
+#######################################################################
+#                          PART 1  Datasets List                      #
+#######################################################################
+
+repeated_info = [
+    (math_datasets, 4),
+    (aime2024_datasets, 32),
+    (aime2025_datasets, 32),
+]
+
+for datasets_, num in repeated_info:
+    for dataset_ in datasets_:
+        dataset_['n'] = num
+
+datasets = sum(
+    (v for k, v in locals().items() if k.endswith('_datasets')),
+    [],
+)
+
+for item in datasets:
+    item['infer_cfg']['inferencer']['max_out_len'] = 32768
+    try:
+        if 'judge_cfg' in item['eval_cfg']['evaluator']:
+           item['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
+        elif'judge_cfg' in item['eval_cfg']['evaluator']['llm_evaluator']:
+            item['eval_cfg']['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
+    except:
+        pass
+#######################################################################
+#                       PART 2  Dataset Summarizer                    #
+#######################################################################
+
+summarizer = dict(
+    dataset_abbrs=[
+        'MATH',
+        ['math_prm800k_500', 'accuracy (4 runs average)'],
+        ['aime2024', 'accuracy (32 runs average)'],
+        ['aime2025', 'accuracy (32 runs average)'],
+        ['livemathbench_hard', 'naive_average'],
+        ['OlympiadBenchMath', 'accuracy'],
+        ['olymmath', 'naive_average'],
+    ],
+    summary_groups = sum(
+        [v for k, v in locals().items() if k.endswith('_summary_groups')], []
+    ),
+)
+
+#######################################################################
+#                        PART 3  Models  List                         #
+#######################################################################
+models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
+models += [
+
+    dict(
+        abbr='Qwen_Qwen3-235B-A22B',
+        type=OpenAISDK,
+        path='Qwen/Qwen3-235B-A22B',
+        key='sk-admin',
+        openai_api_base=[
+            'http://106.15.231.215:40007/v1/',
+        ],
+        meta_template=dict(
+            # begin=dict(role='SYSTEM', api_role='SYSTEM', prompt=''),
+            round=[
+                dict(role='HUMAN', api_role='HUMAN'),
+                # XXX: all system roles are mapped to human in purpose
+                dict(role='BOT', api_role='BOT', generate=True),
+            ]
+        ),
+        query_per_second=16,
+        batch_size=128,
+        # batch_size=1,
+        temperature=0.6,
+        # max_completion_tokens=32768,
+        tokenizer_path='gpt-4',
+        # verbose=True,
+        max_out_len=32768,
+        max_seq_len=32768,
+        pred_postprocessor=dict(type=extract_non_reasoning_content)
+    ),
+]
+
+infer = dict(
+    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
+    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
+)
+
+eval = dict(
+    partitioner=dict(type=NaivePartitioner, n=8),
+    runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)),
+)
+
+base_exp_dir = 'outputs/qwen3_reasoning'
+work_dir = osp.join(base_exp_dir, 'chat_objective')
--- a/opencompass/cli/main.py
+++ b/opencompass/cli/main.py
@ -12,8 +12,8 @@ from mmengine.config import Config, DictAction
 from opencompass.registry import PARTITIONERS, RUNNERS, build_from_cfg
 from opencompass.runners import SlurmRunner
 from opencompass.summarizers import DefaultSummarizer
-from opencompass.utils import (LarkReporter, get_logger, read_from_station,
-                               save_to_station)
+from opencompass.utils import (LarkReporter, get_logger, pretty_print_config,
+                               read_from_station, save_to_station)
 from opencompass.utils.run import (fill_eval_cfg, fill_infer_cfg,
                                   get_config_from_arg)

@ -94,6 +94,11 @@ def parse_args():
        help='Use the custom config directory instead of config/ to '
        'search the configs for datasets, models and summarizers',
        type=str)
+    parser.add_argument(
+        '--config-verbose',
+        default=False,
+        action='store_true',
+        help='Whether to print the config in verbose mode.')
    parser.add_argument('-l',
                        '--lark',
                        help='Report the running status to lark bot',
@ -131,7 +136,7 @@ def parse_args():
        'correctness of each sample, bpb, etc.',
        action='store_true',
    )
-
+    # for the results persistence
    parser.add_argument('-sp',
        '--station-path',
        help='Path to your results station.',
@ -150,7 +155,12 @@ def parse_args():
             'data station.',
        action='store_true',
    )
-
+    # for evaluation with multiple runs
+    parser.add_argument('--dataset-num-runs',
+        help='How many runs for one dataset',
+        type=int,
+        default=1,
+    )

    # set srun args
    slurm_parser = parser.add_argument_group('slurm_args')
@ -299,6 +309,11 @@ def main():
        content = f'{getpass.getuser()}\'s task has been launched!'
        LarkReporter(cfg['lark_bot_url']).post(content)

+
+    # print config if specified --config-verbose
+    if args.config_verbose:
+        pretty_print_config(cfg)
+
    # infer
    if args.mode in ['all', 'infer']:
        # When user have specified --slurm or --dlc, or have not set
--- a/opencompass/configs/datasets/OlymMATH/olymmath_cascade_eval_gen_97b203.py
+++ b/opencompass/configs/datasets/OlymMATH/olymmath_cascade_eval_gen_97b203.py
@ -0,0 +1,109 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import generic_llmjudge_postprocess
+from opencompass.datasets import OlymMATHDataset
+from opencompass.evaluator import (
+    CascadeEvaluator,
+    GenericLLMEvaluator,
+    MATHVerifyEvaluator
+)
+
+
+# ----------------------------- Detailed Config -----------------------------
+
+math_reader_cfg = dict(input_columns=['problem'], output_column='answer', train_split='test')
+
+math_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(role='HUMAN', prompt='{problem}\nRemember to put your final answer within \\boxed{}.'), 
+            ]
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+sub_sets = ['en-hard', 'zh-hard', 'en-easy', 'zh-easy']
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+    5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+
+    <Original Question Begin>: \n{problem}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    
+    Judging the correctness of candidates' answers:
+""".strip()
+
+# Evaluation configuration
+
+olymmath_datasets = []
+
+for sub_set in sub_sets:
+    math_eval_cfg = dict(
+        evaluator=dict(
+            type=CascadeEvaluator,
+            rule_evaluator=dict(
+                type=MATHVerifyEvaluator,
+            ),
+            llm_evaluator=dict(
+                type=GenericLLMEvaluator,
+                prompt_template=dict(
+                    type=PromptTemplate,
+                    template=dict(
+                    begin=[
+                        dict(
+                            role='SYSTEM',
+                            fallback_role='HUMAN',
+                            prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
+                    ],
+                        round=[
+                        dict(
+                            role='HUMAN',
+                            prompt = GRADER_TEMPLATE
+                        ),
+                    ]),
+                ),
+                dataset_cfg=dict(
+                    type=OlymMATHDataset,
+                    path='RUC-AIBOX/OlymMATH',
+                    reader_cfg=math_reader_cfg,
+                    subset=sub_set,
+                ),
+                judge_cfg=dict(),
+                dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+            ),
+            parallel=False,
+        ),
+    )
+    olymmath_datasets.append(
+        dict(
+            type=OlymMATHDataset,
+            abbr=f'olymmath_{sub_set}',
+            path='RUC-AIBOX/OlymMATH',
+            reader_cfg=math_reader_cfg,
+            infer_cfg=math_infer_cfg,
+            eval_cfg=math_eval_cfg,
+            subset=sub_set,
+            n=1
+        )
+    )
--- a/opencompass/configs/datasets/OlympiadBench/OlympiadBench_0shot_cascade_eval_gen_be8b13.py
+++ b/opencompass/configs/datasets/OlympiadBench/OlympiadBench_0shot_cascade_eval_gen_be8b13.py
@ -0,0 +1,114 @@
+from mmengine.config import read_base
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import OlympiadBenchDataset, OlympiadBenchEvaluator, olympiadbench_postprocess_v2
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.evaluator import (
+    GenericLLMEvaluator,
+    CascadeEvaluator,
+    MATHVerifyEvaluator
+)
+from opencompass.datasets import generic_llmjudge_postprocess
+
+with read_base():
+    from .OlympiadBench_categories import categories
+
+# Create prompter instance for problems
+olympiadbench_prompter_cfg = dict(
+    type='OlympiadBenchPrompter'
+)
+
+olympiadbench_reader_cfg = dict(
+    input_columns=[
+        'problem', 'language', 'subject', 'question_type', 
+        'answer_type', 'is_multiple_answer', 'unit', 'questions'
+    ], 
+    output_column='solution'
+)
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+    5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+
+    <Original Question Begin>: \n{problem}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{solution}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    
+    Judging the correctness of candidates' answers:
+""".strip()
+
+
+olympiadbench_datasets = []
+for _name in categories:
+    olympiadbench_infer_cfg = dict(
+        prompt_template=dict(
+            type='OlympiadBenchTemplate'
+        ),
+        retriever=dict(type=ZeroRetriever),
+        inferencer=dict(type=GenInferencer),
+    )
+
+    # Evaluation configuration
+    olympiadbench_eval_cfg = dict(
+        evaluator=dict(
+            type=CascadeEvaluator,
+            rule_evaluator=dict(
+                type=MATHVerifyEvaluator,
+            ),
+            llm_evaluator=dict(
+                type=GenericLLMEvaluator,
+                prompt_template=dict(
+                    type=PromptTemplate,
+                    template=dict(
+                    begin=[
+                        dict(
+                            role='SYSTEM',
+                            fallback_role='HUMAN',
+                            prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
+                    ],
+                        round=[
+                        dict(
+                            role='HUMAN',
+                            prompt = GRADER_TEMPLATE
+                        ),
+                    ]),
+                ),
+                dataset_cfg=dict(
+                    type=OlympiadBenchDataset,
+                    path='opencompass/OlympiadBench',
+                    name=_name,
+                    reader_cfg=olympiadbench_reader_cfg,
+                ),
+                judge_cfg=dict(),
+                dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+            ),
+            parallel=False
+        )
+    )
+
+    olympiadbench_datasets.append(
+        dict(
+            type=OlympiadBenchDataset,
+            abbr=f'OlympiadBench_{_name}',
+            path='opencompass/OlympiadBench',
+            name=_name,
+            reader_cfg=olympiadbench_reader_cfg,
+            infer_cfg=olympiadbench_infer_cfg,
+            eval_cfg=olympiadbench_eval_cfg,
+            n=1,
+        )
+    )
--- a/opencompass/configs/datasets/aime2024/aime2024_0shot_nocot_genericllmeval_xml_gen_2b9dc2.py
+++ b/opencompass/configs/datasets/aime2024/aime2024_0shot_nocot_genericllmeval_xml_gen_2b9dc2.py
@ -1,28 +1,44 @@
+"""
+Summary: A config for AIME-2024 Evaluation.
+Setting:
+    Shot: 0-shot
+    Evaluator:
+        - CascadeEvaluator
+            - MATHVerifyEvaluator
+            - GenericLLMEvaluator
+    Repeat: 1
+Avaliable Models:
+    - Instruct/Chat Models
+"""
 from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
-from opencompass.datasets import Aime2024Dataset, MATHEvaluator, math_postprocess_v2
-from opencompass.evaluator import GenericLLMEvaluator
 from opencompass.datasets import generic_llmjudge_postprocess
-from opencompass.utils import xml_tag_postprocessor
-
-aime2024_reader_cfg = dict(
-    input_columns=['question'], 
-    output_column='answer'
+from opencompass.datasets import Aime2024Dataset
+from opencompass.evaluator import (
+    CascadeEvaluator,
+    GenericLLMEvaluator,
+    MATHVerifyEvaluator
 )


+aime2024_reader_cfg = dict(input_columns=['question'], output_column='answer')
+
+
 aime2024_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(
            round=[
-                dict(role='HUMAN', prompt='{question}\nRemember to put your final answer within \\boxed{}.'),
+                dict(
+                    role='HUMAN',
+                    prompt='{question}\nRemember to put your final answer within \\boxed{}.',
+                ),
            ],
-        )
+        ),
    ),
    retriever=dict(type=ZeroRetriever),
-    inferencer=dict(type=GenInferencer, max_out_len=2048)
+    inferencer=dict(type=GenInferencer),
 )


@ -51,24 +67,27 @@ GRADER_TEMPLATE = """
    Judging the correctness of candidates' answers:
 """.strip()

-aime2024_eval_cfg = dict(
-    evaluator=dict(
+cascade_evaluator = dict(
+    type=CascadeEvaluator,
+    rule_evaluator=dict(
+        type=MATHVerifyEvaluator,
+    ),
+    llm_evaluator= dict(
        type=GenericLLMEvaluator,
        prompt_template=dict(
            type=PromptTemplate,
            template=dict(
-            begin=[
-                dict(
-                    role='SYSTEM',
-                    fallback_role='HUMAN',
-                    prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
-            ],
+                begin=[
+                    dict(
+                        role='SYSTEM',
+                        fallback_role='HUMAN',
+                        prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                    )
+                ],
                round=[
-                dict(
-                    role='HUMAN',
-                    prompt = GRADER_TEMPLATE
-                ),
-            ]),
+                    dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                ],
+            ),
        ),
        dataset_cfg=dict(
            type=Aime2024Dataset,
@ -77,9 +96,13 @@ aime2024_eval_cfg = dict(
        ),
        judge_cfg=dict(),
        dict_postprocessor=dict(type=generic_llmjudge_postprocess),
-        pred_postprocessor=dict(type=xml_tag_postprocessor, tag='<conclude>'),
    ),
-    pred_role='BOT',
+    parallel=False,
+)
+
+
+aime2024_eval_cfg = dict(
+    evaluator=cascade_evaluator,
 )

 aime2024_datasets = [
@ -90,6 +113,6 @@ aime2024_datasets = [
        reader_cfg=aime2024_reader_cfg,
        infer_cfg=aime2024_infer_cfg,
        eval_cfg=aime2024_eval_cfg,
-        mode='singlescore',
+        n=1,# Evaluate the dataset with n times
    )
-]
+]
--- a/opencompass/configs/datasets/aime2024/aime2024_llm_judge_gen.py
+++ b/opencompass/configs/datasets/aime2024/aime2024_llm_judge_gen.py
--- a/opencompass/configs/datasets/aime2025/aime2025_cascade_eval_gen_5e9f4f.py
+++ b/opencompass/configs/datasets/aime2025/aime2025_cascade_eval_gen_5e9f4f.py
@ -0,0 +1,115 @@
+"""
+Summary: A config for AIME-2025 Evaluation.
+Setting:
+    Shot: 0-shot
+    Evaluator:
+        - CascadeEvaluator
+            - MATHVerifyEvaluator
+            - GenericLLMEvaluator
+    Repeat: 1
+Avaliable Models:
+    - Instruct/Chat Models
+"""
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import CustomDataset
+from opencompass.datasets import generic_llmjudge_postprocess
+from opencompass.evaluator import (
+    CascadeEvaluator,
+    GenericLLMEvaluator,
+    MATHVerifyEvaluator
+)
+
+aime2025_reader_cfg = dict(input_columns=['question'], output_column='answer')
+
+
+aime2025_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role='HUMAN',
+                    prompt='{question}\nRemember to put your final answer within \\boxed{}.',
+                ),
+            ],
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+    5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+
+    <Original Question Begin>: \n{question}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    
+    Judging the correctness of candidates' answers:
+""".strip()
+
+cascade_evaluator = dict(
+    type=CascadeEvaluator,
+    rule_evaluator=dict(
+        type=MATHVerifyEvaluator,
+    ),
+    llm_evaluator=dict(
+        type=GenericLLMEvaluator,
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(
+                begin=[
+                    dict(
+                        role='SYSTEM',
+                        fallback_role='HUMAN',
+                        prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                    )
+                ],
+                round=[
+                    dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                ],
+            ),
+        ),
+        dataset_cfg=dict(
+            type=CustomDataset,
+            path='opencompass/aime2025',
+            reader_cfg=aime2025_reader_cfg,
+        ),
+        judge_cfg=dict(),
+        dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+    ),
+    parallel=False,
+)
+aime2025_eval_cfg = dict(
+    evaluator=cascade_evaluator,
+)
+
+aime2025_datasets = [
+    dict(
+        type=CustomDataset,
+        abbr='aime2025',
+        path='opencompass/aime2025',
+        reader_cfg=aime2025_reader_cfg,
+        infer_cfg=aime2025_infer_cfg,
+        eval_cfg=aime2025_eval_cfg,
+        n=1,
+    )
+]
--- a/opencompass/configs/datasets/gpqa/gpqa_cascade_eval_gen_772ea0.py
+++ b/opencompass/configs/datasets/gpqa/gpqa_cascade_eval_gen_772ea0.py
@ -0,0 +1,118 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import GPQADataset, GPQA_Simple_Eval_postprocess
+from opencompass.evaluator import GenericLLMEvaluator, CascadeEvaluator
+from opencompass.datasets import generic_llmjudge_postprocess
+from opencompass.openicl.icl_evaluator import AccEvaluator
+from opencompass.utils.text_postprocessors import match_answer_pattern
+
+# openai_simple_eval prompt
+align_prompt = """
+Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of ABCD.
+
+{question}
+
+A) {A}
+B) {B}
+C) {C}
+D) {D}
+""".strip()
+
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+    <Original Question Begin>: {question}\n A) {A}\n B) {B}\n C) {C}\n D) {D}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    Judging the correctness of candidates' answers:
+""".strip()
+
+
+gpqa_reader_cfg = dict(
+    input_columns=['question', 'A', 'B', 'C', 'D'],
+    output_column='answer')
+
+gpqa_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(role='HUMAN', prompt=align_prompt),
+            ], )),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer))
+
+
+
+gpqa_datasets = []
+gpqa_subsets = {
+    # 'extended': 'gpqa_extended.csv',
+    # 'main': 'gpqa_main.csv',
+    'diamond': 'gpqa_diamond.csv'
+}
+
+for split in list(gpqa_subsets.keys()):
+    gpqa_eval_cfg = dict(
+        evaluator=dict(
+            type=CascadeEvaluator,
+            rule_evaluator=dict(
+                type=AccEvaluator,
+                pred_postprocessor=dict(type=match_answer_pattern, answer_pattern=r'(?i)ANSWER\s*:\s*([A-D])'),
+            ),
+            llm_evaluator=dict(
+                type=GenericLLMEvaluator,
+                prompt_template=dict(
+                    type=PromptTemplate,
+                    template=dict(
+                    begin=[
+                        dict(
+                            role='SYSTEM',
+                            fallback_role='HUMAN',
+                            prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
+                    ],
+                        round=[
+                        dict(
+                            role='HUMAN',
+                            prompt = GRADER_TEMPLATE
+                        ),
+                    ]),
+                ),
+                dataset_cfg=dict(
+                    type=GPQADataset,
+                    path='./data/gpqa/',
+                    name=gpqa_subsets[split],
+                    reader_cfg=gpqa_reader_cfg,
+                ),
+                judge_cfg=dict(),
+                dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+            ),
+            parallel=False,
+        ),
+    )
+    gpqa_datasets.append(
+        dict(
+            abbr='GPQA_' + split,
+            type=GPQADataset,
+            path='./data/gpqa/',
+            name=gpqa_subsets[split],
+            reader_cfg=gpqa_reader_cfg,
+            infer_cfg=gpqa_infer_cfg,
+            eval_cfg=gpqa_eval_cfg,
+            mode='singlescore',
+        )
+    )
--- a/opencompass/configs/datasets/korbench/korbench_single_0shot_genericllmeval_xml_gen_17854d.py
+++ b/opencompass/configs/datasets/korbench/korbench_single_0shot_genericllmeval_xml_gen_17854d.py
@ -1,17 +1,28 @@
+"""
+Summary: A config for KoR-Bench Evaluation.
+Setting:
+    Shot: 0-shot
+    Evaluator:
+        - CascadeEvaluator
+            - korbenchEvaluator
+            - GenericLLMEvaluator
+    Repeat: 1
+Avaliable Models:
+    - Instruct/Chat Models
+"""
+from datasets import parallel
 from opencompass.datasets.korbench.korbench import korbenchDataset, korbenchEvaluator
 from opencompass.openicl.icl_inferencer import GenInferencer
 from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
-from opencompass.evaluator import GenericLLMEvaluator
+from opencompass.evaluator import GenericLLMEvaluator, CascadeEvaluator
 from opencompass.datasets import generic_llmjudge_postprocess
-from opencompass.utils import xml_tag_postprocessor

 categories = ['cipher', 'counterfactual', 'logic', 'operation', 'puzzle']

-
 GRADER_TEMPLATE = """
    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
-    
+
    Here are some evaluation criteria:
    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
@ -30,7 +41,7 @@ GRADER_TEMPLATE = """
    <Original Question Begin>: \n{prompt}\n<Original Question End>\n\n
    <Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
-    
+
    Judging the correctness of candidates' answers:
 """.strip()

@ -50,7 +61,7 @@ for category in categories:
            round=[
                dict(
                    role='HUMAN',
-                    prompt='{prompt}' # f-string
+                    prompt='{prompt}'  # f-string
                )
            ]
        )
@ -66,41 +77,46 @@ for category in categories:
    infer_cfg = dict(
        prompt_template=prompt_template,
        retriever=dict(type=ZeroRetriever),
-        inferencer=dict(type=GenInferencer, max_out_len=1024),
+        inferencer=dict(type=GenInferencer),
    )

    # Evaluation configuration
    eval_cfg = dict(
        evaluator=dict(
-            type=GenericLLMEvaluator,
-            prompt_template=dict(
-                type=PromptTemplate,
-                template=dict(
-                begin=[
-                    dict(
-                        role='SYSTEM',
-                        fallback_role='HUMAN',
-                        prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
-                ],
-                    round=[
-                    dict(
-                        role='HUMAN',
-                        prompt = GRADER_TEMPLATE
-                    ),
-                ]),
+            type=CascadeEvaluator,
+            rule_evaluator=dict(
+                type=korbenchEvaluator,
            ),
-            dataset_cfg=dict(
-                type=korbenchDataset,
-                path='opencompass/korbench',
-                prompt_mode='0_shot',
-                category=category,
-                reader_cfg=reader_cfg,
+            llm_evaluator=dict(
+                type=GenericLLMEvaluator,
+                prompt_template=dict(
+                    type=PromptTemplate,
+                    template=dict(
+                        begin=[
+                            dict(
+                                role='SYSTEM',
+                                fallback_role='HUMAN',
+                                prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
+                        ],
+                        round=[
+                            dict(
+                                role='HUMAN',
+                                prompt=GRADER_TEMPLATE
+                            ),
+                        ]),
+                ),
+                dataset_cfg=dict(
+                    type=korbenchDataset,
+                    path='opencompass/korbench',
+                    prompt_mode='0_shot',
+                    category=category,
+                    reader_cfg=reader_cfg,
+                ),
+                judge_cfg=dict(),
+                dict_postprocessor=dict(type=generic_llmjudge_postprocess),
            ),
-            judge_cfg=dict(),
-            dict_postprocessor=dict(type=generic_llmjudge_postprocess),
-            pred_postprocessor=dict(type=xml_tag_postprocessor, tag='<conclude>'),
-        ),
-        pred_role='BOT',
+            parallel=False,
+        )
    )

    # Dataset
@ -113,7 +129,7 @@ for category in categories:
        reader_cfg=reader_cfg,
        infer_cfg=infer_cfg,
        eval_cfg=eval_cfg,
-        mode='singlescore',
+        n=1,
    )

-    korbench_0shot_single_datasets.append(korbench_dataset)
+    korbench_0shot_single_datasets.append(korbench_dataset)
--- a/opencompass/configs/datasets/livemathbench/livemathbench_hard_custom_cascade_eval_gen_4bce59.py
+++ b/opencompass/configs/datasets/livemathbench/livemathbench_hard_custom_cascade_eval_gen_4bce59.py
@ -0,0 +1,120 @@
+"""
+Summary: A config for LiveMathBench-Hard-202412 Dataset Evaluation.
+Setting:
+    Shot: 0-shot
+    Evaluator:
+        - CascadeEvaluator
+            - MATHVerifyEvaluator
+            - GenericLLMEvaluator
+    Repeat: 32
+Avaliable Models:
+    - Instruct/Chat Models
+"""
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.evaluator import GenericLLMEvaluator
+from opencompass.datasets import CustomDataset
+from opencompass.datasets import generic_llmjudge_postprocess
+from opencompass.evaluator import (
+    CascadeEvaluator,
+    GenericLLMEvaluator,
+    MATHVerifyEvaluator,
+)
+
+livemathbench_reader_cfg = dict(input_columns=['question'], output_column='answer')
+
+
+# Inference configuration
+livemathbench_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(
+                    role='HUMAN',
+                    prompt='{question}\nRemember to put your final answer within \\boxed{}.',
+                ),
+            ]
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+
+# Template for the LLM judge
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+    5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+    <Original Question Begin>: \n{question}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    
+    Judging the correctness of candidates' answers:
+""".strip()
+
+
+
+splits = ['hard_cn', 'hard_en']
+# Dataset configuration
+livemathbench_datasets = [
+    dict(
+        type=CustomDataset,
+        abbr=f'livemathbench_hard_custom_{split}',
+        path='data/LiveMathBench',
+        local_mode=True,
+        file_name=f'202412/{split}.jsonl',
+        reader_cfg=livemathbench_reader_cfg,
+        infer_cfg=livemathbench_infer_cfg,
+        eval_cfg=dict(
+            # Evaluation configuration using LLM as judge
+            evaluator=dict(
+                type=CascadeEvaluator,
+                rule_evaluator=dict(
+                    type=MATHVerifyEvaluator,
+                ),
+                llm_evaluator=dict(
+                    type=GenericLLMEvaluator,
+                    prompt_template=dict(
+                        type=PromptTemplate,
+                        template=dict(
+                            begin=[
+                                dict(
+                                    role='SYSTEM',
+                                    fallback_role='HUMAN',
+                                    prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                                )
+                            ],
+                            round=[
+                                dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                            ],
+                        ),
+                    ),
+                    dataset_cfg=dict(
+                        type=CustomDataset,
+                        path='data/LiveMathBench',
+                        local_mode=True,
+                        file_name=f'202412/{split}.jsonl',
+                        reader_cfg=livemathbench_reader_cfg,
+                    ),
+                    judge_cfg={},
+                    dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+                ),
+                parallel=False
+            ),
+        ),
+        n=1, # repeat n times
+    ) for split in splits
+]
--- a/opencompass/configs/datasets/livereasonbench/livereasonbench_llmverify_20250428_gen_0484cb.py
+++ b/opencompass/configs/datasets/livereasonbench/livereasonbench_llmverify_20250428_gen_0484cb.py
@ -4,7 +4,6 @@ from opencompass.openicl.icl_inferencer import GenInferencer

 from opencompass.evaluator import GenericLLMEvaluator
 from opencompass.datasets import LiveReasonBenchDataset, livereasonbench_postprocess
-from opencompass.utils import xml_tag_postprocessor


 GRADER_TEMPLATE = """
@ -97,7 +96,7 @@ livereasonbench_infer_cfg = dict(
            ],
        )),
    retriever=dict(type=ZeroRetriever),
-    inferencer=dict(type=GenInferencer, max_out_len=16384))
+    inferencer=dict(type=GenInferencer))

 livereasonbench_eval_cfg = dict(
    evaluator=dict(
@ -122,23 +121,22 @@ livereasonbench_eval_cfg = dict(
            type=LiveReasonBenchDataset,
            path='opencompass/LiveReasonBench',
            reader_cfg=livereasonbench_reader_cfg,
+            version='livereasonbench-20250428',
        ),
        judge_cfg=dict(),
        dict_postprocessor=dict(type=livereasonbench_postprocess),
-        pred_postprocessor=dict(type=xml_tag_postprocessor, tag='<conclude>'),
    ),
-    pred_role='BOT',
 )

 livereasonbench_datasets = [
    dict(
-        abbr='LiveReasonBench-20241202',
+        abbr='LiveReasonBench-20250428',
        type=LiveReasonBenchDataset,
        path='opencompass/LiveReasonBench',
        reader_cfg=livereasonbench_reader_cfg,
        infer_cfg=livereasonbench_infer_cfg,
        eval_cfg=livereasonbench_eval_cfg,
-        version='livereasonbench-20241202',
-        mode='singlescore',
+        version='livereasonbench-20250428',
+        n=1
  )
 ]
--- a/opencompass/configs/datasets/math/math_500_cascade_eval_gen_6ff468.py
+++ b/opencompass/configs/datasets/math/math_500_cascade_eval_gen_6ff468.py
@ -0,0 +1,117 @@
+"""
+Summary: A config for AIME-2024 Evaluation.
+Setting:
+    Shot: 0-shot
+    Evaluator:
+        - CascadeEvaluator
+            - MATHVerifyEvaluator
+            - GenericLLMEvaluator
+Avaliable Models:
+    - Instruct/Chat Models
+"""
+
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import generic_llmjudge_postprocess
+from opencompass.datasets import MATHDataset
+from opencompass.evaluator import (
+    CascadeEvaluator,
+    GenericLLMEvaluator,
+    MATHVerifyEvaluator
+)
+
+# ----------------------------- Detailed Config -----------------------------
+
+math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
+math_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(
+            round=[
+                dict(role='HUMAN', prompt='{problem}\nRemember to put your final answer within \\boxed{}.'),
+            ]
+        ),
+    ),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+    5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+
+    <Original Question Begin>: \n{problem}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{solution}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    
+    Judging the correctness of candidates' answers:
+""".strip()
+
+
+cascade_evaluator = dict(
+    type=CascadeEvaluator,
+    rule_evaluator=dict(
+        type=MATHVerifyEvaluator,
+    ),
+    llm_evaluator= dict(
+        dict(
+            type=GenericLLMEvaluator,
+            prompt_template=dict(
+                type=PromptTemplate,
+                template=dict(
+                    begin=[
+                        dict(
+                            role='SYSTEM',
+                            fallback_role='HUMAN',
+                            prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
+                        )
+                    ],
+                    round=[
+                        dict(role='HUMAN', prompt=GRADER_TEMPLATE),
+                    ],
+                ),
+            ),
+            dataset_cfg=dict(
+                type=MATHDataset,
+                path='opencompass/math',
+                file_name = 'test_prm800k_500.json',
+                reader_cfg=math_reader_cfg,
+                n=4,
+            ),
+            judge_cfg=dict(),
+            dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+        )
+    ),
+    parallel=False,
+)
+
+math_datasets = [
+    dict(
+        type=MATHDataset,
+        abbr=f'math_prm800k_500',
+        path='opencompass/math',
+        file_name = 'test_prm800k_500.json',
+        reader_cfg=math_reader_cfg,
+        infer_cfg=math_infer_cfg,
+        eval_cfg=dict(
+            evaluator=cascade_evaluator,
+        ),
+        n=1,
+    )
+]
--- a/opencompass/configs/datasets/math/math_500_gen.py
+++ b/opencompass/configs/datasets/math/math_500_gen.py
@ -2,7 +2,7 @@ from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
 from opencompass.datasets import CustomDataset
-from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
+from opencompass.evaluator import MATHVerifyEvaluator

 math_reader_cfg = dict(input_columns=['problem'], output_column='solution')

@ -24,7 +24,7 @@ math_infer_cfg = dict(


 math_eval_cfg = dict(
-    evaluator=dict(type=MATHEvaluator),
+    evaluator=dict(type=MATHVerifyEvaluator),
 )

 math_datasets = [
--- a/opencompass/configs/datasets/math/math_gen_a58d9d.py
+++ b/opencompass/configs/datasets/math/math_gen_a58d9d.py
@ -2,7 +2,7 @@ from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
 from opencompass.datasets import MATHDataset
-from opencompass.openicl.icl_evaluator import MATHEvaluator
+from opencompass.evaluator import MATHVerifyEvaluator

 math_reader_cfg = dict(input_columns=['problem'], output_column='solution')

@ -24,7 +24,7 @@ math_infer_cfg = dict(
    inferencer=dict(type=GenInferencer))

 math_eval_cfg = dict(
-    evaluator=dict(type=MATHEvaluator)
+    evaluator=dict(type=MATHVerifyEvaluator)
 )

 math_datasets = [
--- a/opencompass/configs/datasets/math/math_prm800k_500_0shot_cot_gen_11c4b5.py
+++ b/opencompass/configs/datasets/math/math_prm800k_500_0shot_cot_gen_11c4b5.py
@ -1,7 +1,7 @@
 from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
-from opencompass.openicl.icl_evaluator import MATHEvaluator
+from opencompass.evaluator import MATHVerifyEvaluator
 from opencompass.datasets import (
    MATHDataset,
    math_postprocess_v2,
@ -28,7 +28,7 @@ math_infer_cfg = dict(

 # postprocess v2
 math_eval_cfg = dict(
-    evaluator=dict(type=MATHEvaluator)
+    evaluator=dict(type=MATHVerifyEvaluator)
 )

 math_datasets = [
--- a/opencompass/configs/datasets/mmlu/mmlu_stem_0shot_cascade_eval_gen_216503.py
+++ b/opencompass/configs/datasets/mmlu/mmlu_stem_0shot_cascade_eval_gen_216503.py
@ -0,0 +1,127 @@
+"""
+Setting: 0-shot No-CoT
+Evaluator: GenericLLMEvaluator
+"""
+from mmengine.config import read_base
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.openicl.icl_evaluator import AccEvaluator
+from opencompass.datasets import MMLUDataset
+from opencompass.utils.text_postprocessors import match_answer_pattern
+from opencompass.datasets import generic_llmjudge_postprocess
+from opencompass.evaluator import (
+    CascadeEvaluator,
+    GenericLLMEvaluator,
+)
+
+with read_base():
+    # from .....configs.datasets.mmlu.mmlu_all_sets import mmlu_all_sets
+    from .mmlu_stem_sets import mmlu_all_sets
+# None of the mmlu dataset in huggingface is correctly parsed, so we use our own dataset reader
+# Please download the dataset from https://people.eecs.berkeley.edu/~hendrycks/data.tar
+
+QUERY_TEMPLATE = """
+Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of ABCD. 
+
+{input}
+
+A) {A}
+B) {B}
+C) {C}
+D) {D}
+""".strip()
+
+
+GRADER_TEMPLATE = """
+    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
+    
+    Here are some evaluation criteria:
+    1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
+    2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
+    3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
+    4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
+
+    Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
+    A: CORRECT 
+    B: INCORRECT
+    Just return the letters "A" or "B", with no text around it.
+
+    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
+
+    <Original Question Begin>: {input}\n A) {A}\n B) {B}\n C) {C}\n D) {D}\n<Original Question End>\n\n
+    <Gold Target Begin>: \n{target}\n<Gold Target End>\n\n
+    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
+    Judging the correctness of candidates' answers:
+""".strip()
+
+mmlu_reader_cfg = dict(
+    input_columns=['input', 'A', 'B', 'C', 'D'],
+    output_column='target',
+    train_split='dev')
+
+mmlu_datasets = []
+for name in mmlu_all_sets:
+    mmlu_infer_cfg = dict(
+        prompt_template=dict(
+            type=PromptTemplate,
+            template=dict(
+                round=[
+                    dict(role='HUMAN', prompt=QUERY_TEMPLATE),
+                ],
+            ),
+        ),
+        retriever=dict(type=ZeroRetriever),
+        inferencer=dict(type=GenInferencer),
+    )
+    
+    mmlu_eval_cfg = dict(
+        evaluator=dict(
+            type=CascadeEvaluator,
+            rule_evaluator=dict(
+                type=AccEvaluator,
+                pred_postprocessor=dict(type=match_answer_pattern, answer_pattern=r'(?i)ANSWER\s*:\s*([A-D])'),
+            ),
+            llm_evaluator = dict(
+                type=GenericLLMEvaluator,
+                prompt_template=dict(
+                    type=PromptTemplate,
+                    template=dict(
+                    begin=[
+                        dict(
+                            role='SYSTEM',
+                            fallback_role='HUMAN',
+                            prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
+                    ],
+                        round=[
+                        dict(
+                            role='HUMAN',
+                            prompt = GRADER_TEMPLATE
+                        ),
+                    ]),
+                ),
+                dataset_cfg=dict(
+                    abbr=f'lukaemon_mmlu_{name}',
+                    type=MMLUDataset,
+                    path='opencompass/mmlu',
+                    name=name,
+                    reader_cfg=mmlu_reader_cfg,
+                ),
+                dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+                judge_cfg=dict(),
+            ),
+            parallel=False
+        ),
+    )
+
+    mmlu_datasets.append(
+        dict(
+            abbr=f'lukaemon_mmlu_{name}',
+            type=MMLUDataset,
+            path='opencompass/mmlu',
+            name=name,
+            reader_cfg=mmlu_reader_cfg,
+            infer_cfg=mmlu_infer_cfg,
+            eval_cfg=mmlu_eval_cfg,
+            mode='singlescore',
+        ))
--- a/opencompass/configs/datasets/aime2024/aime2024_0shot_nocot_llmjudge_gen_2b9dc2.py
+++ b/opencompass/configs/datasets/aime2024/aime2024_0shot_nocot_llmjudge_gen_2b9dc2.py
@ -1,30 +1,46 @@
+"""
+Summary: A config for OmniMath Dataset Evaluation.
+Setting:
+    Shot: 0-shot
+    Evaluator:
+        - CascadeEvaluator
+            - MATHVerifyEvaluator
+            - GenericLLMEvaluator
+    Repeat: 1
+Avaliable Models:
+    - Instruct/Chat Models
+"""
 from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
-from opencompass.datasets import Aime2024Dataset, MATHEvaluator, math_postprocess_v2
-from opencompass.openicl.icl_evaluator import LMEvaluator
 from opencompass.datasets import generic_llmjudge_postprocess
+from opencompass.datasets.omni_math import OmniMathDataset
+from opencompass.evaluator import (
+    CascadeEvaluator,
+    GenericLLMEvaluator,
+    MATHVerifyEvaluator,
+)

-aime2024_reader_cfg = dict(
-    input_columns=['question'], 
+omnimath_reader_cfg = dict(
+    input_columns=['problem'], 
    output_column='answer'
 )

-
-aime2024_infer_cfg = dict(
+omnimath_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(
            round=[
-                dict(role='HUMAN', prompt='{question}\nRemember to put your final answer within \\boxed{}.'),
-            ],
+                dict(role='HUMAN', prompt='please answer the following mathematical question, put your final answer in \\boxed{}.\n\n{problem}'),
+            ]
        )
    ),
    retriever=dict(type=ZeroRetriever),
-    inferencer=dict(type=GenInferencer, max_out_len=2048)
+    inferencer=dict(type=GenInferencer)
 )


+
 GRADER_TEMPLATE = """
    Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly. 
    
@ -43,16 +59,20 @@ GRADER_TEMPLATE = """
    Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.


-    <Original Question Begin>: \n{question}\n<Original Question End>\n\n
+    <Original Question Begin>: \n{problem}\n<Original Question End>\n\n
    <Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
    <Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
    
    Judging the correctness of candidates' answers:
 """.strip()

-aime2024_eval_cfg = dict(
-    evaluator=dict(
-        type=LMEvaluator,
+cascade_evaluator = dict(
+    type=CascadeEvaluator,
+    rule_evaluator=dict(
+        type=MATHVerifyEvaluator,
+    ),
+    llm_evaluator=dict(
+        type=GenericLLMEvaluator,
        prompt_template=dict(
            type=PromptTemplate,
            template=dict(
@ -69,19 +89,27 @@ aime2024_eval_cfg = dict(
                ),
            ]),
        ),
-        dict_postprocessor=dict(type=generic_llmjudge_postprocess),
+        dataset_cfg=dict(
+            type=OmniMathDataset,
+            reader_cfg=omnimath_reader_cfg,
+        ),
+        judge_cfg=dict(),
+        dict_postprocessor=dict(type=generic_llmjudge_postprocess),    
    ),
-    pred_role='BOT',
+    parallel=False,
 )

-aime2024_datasets = [
+omnimath_eval_cfg = dict(
+    evaluator=cascade_evaluator,
+)
+
+omnimath_datasets = [
    dict(
-        abbr='aime2024',
-        type=Aime2024Dataset,
-        path='opencompass/aime2024',
-        reader_cfg=aime2024_reader_cfg,
-        infer_cfg=aime2024_infer_cfg,
-        eval_cfg=aime2024_eval_cfg,
-        mode='singlescore',
+        type=OmniMathDataset,
+        abbr='OmniMath',
+        reader_cfg=omnimath_reader_cfg,
+        infer_cfg=omnimath_infer_cfg,
+        eval_cfg=omnimath_eval_cfg,
+        n=1,
    )
 ]
--- a/opencompass/configs/summarizers/example.py
+++ b/opencompass/configs/summarizers/example.py
@ -1,18 +1,19 @@
 from mmengine.config import read_base

-with read_base():
-    from .groups.agieval import agieval_summary_groups
-    from .groups.mmlu import mmlu_summary_groups
-    from .groups.cmmlu import cmmlu_summary_groups
-    from .groups.ceval import ceval_summary_groups
-    from .groups.bbh import bbh_summary_groups
-    from .groups.GaokaoBench import GaokaoBench_summary_groups
-    from .groups.flores import flores_summary_groups
-    from .groups.tydiqa import tydiqa_summary_groups
-    from .groups.xiezhi import xiezhi_summary_groups
-    from .groups.scibench import scibench_summary_groups
-    from .groups.mgsm import mgsm_summary_groups
-    from .groups.longbench import longbench_summary_groups
+# with read_base():
+    # pass
+    # from .groups.agieval import agieval_summary_groups
+    # from .groups.mmlu import mmlu_summary_groups
+    # from .groups.cmmlu import cmmlu_summary_groups
+    # from .groups.ceval import ceval_summary_groups
+    # from .groups.bbh import bbh_summary_groups
+    # from .groups.GaokaoBench import GaokaoBench_summary_groups
+    # from .groups.flores import flores_summary_groups
+    # from .groups.tydiqa import tydiqa_summary_groups
+    # from .groups.xiezhi import xiezhi_summary_groups
+    # from .groups.scibench import scibench_summary_groups
+    # from .groups.mgsm import mgsm_summary_groups
+    # from .groups.longbench import longbench_summary_groups

 summarizer = dict(
    summary_groups=sum([v for k, v in locals().items() if k.endswith('_summary_groups')], []),
--- a/opencompass/datasets/base.py
+++ b/opencompass/datasets/base.py
@ -3,6 +3,9 @@ from typing import Dict, List, Optional, Union
 from datasets import Dataset, DatasetDict, concatenate_datasets

 from opencompass.openicl import DatasetReader
+from opencompass.utils import get_logger
+
+logger = get_logger()


 class BaseDataset:
--- a/opencompass/datasets/korbench/korbench.py
+++ b/opencompass/datasets/korbench/korbench.py
@ -173,44 +173,76 @@ class korbenchEvaluator(BaseEvaluator):
    def __init__(self):
        super().__init__()

-    def score(self, predictions, references, test_set):
-        """Evaluate predictions for a single prompt_mode in KOR-Bench."""
-        if not test_set:
-            raise ValueError('Test set is empty.')
+    def sample_score(self, prediction, reference, test_item=None):
+        """Evaluate a single sample.

-        prompt_mode = test_set[0][
-            'prompt_mode']  # Determine the prompt_mode from the first entry
-        data = {}
+        Args:
+            prediction: The model's prediction
+            reference: The reference answer
+            test_item: Additional information about the test sample

-        # Organize data for the given prompt_mode
-        for i in range(len(predictions)):
-            entry = {
-                'prediction': predictions[i],
-                'gold': references[i],
-                'rule_id': test_set[i].get('rule_id', None),
-                'category': test_set[i].get('category', None),
-                'rule_list': test_set[i].get('rule_list', None),
-                'question_list': test_set[i].get('question_list', None),
-                'base_path': test_set[i].get('base_path', None),
-            }
-            data[i] = entry
+        Returns:
+            Dict: A dictionary containing evaluation results
+        """
+        if test_item is None:
+            raise ValueError('Test item is required.')

-        if not data:
-            raise ValueError(f"No data found for prompt_mode '{prompt_mode}'")
+        prompt_mode = test_item.get('prompt_mode')

-        # Evaluate based on the prompt_mode
+        # Build data for a single sample
+        entry = {
+            'prediction': prediction,
+            'gold': reference,
+            'rule_id': test_item.get('rule_id', None),
+            'category': test_item.get('category', None),
+            'rule_list': test_item.get('rule_list', None),
+            'question_list': test_item.get('question_list', None),
+            'base_path': test_item.get('base_path', None),
+        }
+
+        # Evaluate the single sample
+        data = {0: entry}
+
+        # Evaluate based on different prompt_mode
        if prompt_mode == '0_shot':
            evaluation_results = evaluate_responses(data, '0_shot')
        elif prompt_mode == '3_shot':
            evaluation_results = evaluate_responses(data, '3_shot')
        elif prompt_mode in ['Multi-Q', 'Multi-R', 'Multi-RQ', 'mixed']:
            evaluation_results = evaluate_responses(data, 'mixed',
-                                                    test_set[0]['base_path'])
+                                                    test_item.get('base_path'))
        else:
-            raise ValueError(f'Unsupported prompt_mode: {prompt_mode}')
-        # Calculate accuracy
-        correct_count = sum(res['is_correct'] for res in evaluation_results)
-        accuracy = (correct_count / len(evaluation_results)) * 100
+            return {
+                'is_correct': False,
+                'pred': prediction,
+                'answer': reference
+            }

-        # Return scores
-        return {'accuracy': accuracy}
+        # Return evaluation results
+        result = evaluation_results[0]
+        result['correct'] = result['is_correct']
+        result.update({'pred': prediction, 'answer': reference})
+        return result
+
+    def score(self, predictions, references, test_set):
+        """Evaluate each sample using sample_score."""
+        if not test_set:
+            raise ValueError('Test set is empty.')
+
+        details = []
+        correct_count = 0
+
+        # Call sample_score for each sample
+        for i in range(len(predictions)):
+            result = self.sample_score(predictions[i], references[i],
+                                       test_set[i])
+            details.append(result)
+            if result.get('is_correct', False):
+                correct_count += 1
+
+        # Calculate accuracy
+        accuracy = (correct_count /
+                    len(predictions)) * 100 if predictions else 0
+
+        # Return evaluation results
+        return {'accuracy': accuracy, 'details': details}
--- a/opencompass/datasets/math.py
+++ b/opencompass/datasets/math.py
@ -204,7 +204,11 @@ def math_postprocess_v2(text: str) -> str:
@ICL_EVALUATORS.register_module()
 class MATHEvaluator(BaseEvaluator):

-    def __init__(self, version='v1'):
+    def __init__(self,
+                 version='v1',
+                 pred_postprocessor=None):  # 可能需要接收父类__init__的参数
+        super().__init__(
+            pred_postprocessor=pred_postprocessor)  # 调用父类的__init__
        assert version in ['v1', 'v2']
        self.version = version

--- a/opencompass/datasets/musr/musr.py
+++ b/opencompass/datasets/musr/musr.py
@ -280,7 +280,11 @@ class MusrDataset(BaseDataset):
@ICL_EVALUATORS.register_module()
 class MusrEvaluator(BaseEvaluator):

-    def __init__(self, answer_index_modifier=1, self_consistency_n=1):
+    def __init__(self,
+                 answer_index_modifier=1,
+                 self_consistency_n=1,
+                 pred_postprocessor=None):
+        super().__init__(pred_postprocessor=pred_postprocessor)
        self.answer_index_modifier = answer_index_modifier
        self.self_consistency_n = self_consistency_n

--- a/opencompass/datasets/teval/evaluators/review_evaluator.py
+++ b/opencompass/datasets/teval/evaluators/review_evaluator.py
@ -76,7 +76,6 @@ class ReviewEvaluator:

        pred_data = data_sample.pred
        if pred_data is not None:
-            # import pdb; pdb.set_trace()
            metrics_result['review_quality'] = 1.0 if pred_data == \
                data_sample.gt else 0.0
            metrics_result['parse_rate'] = 1.0
--- a/opencompass/evaluator/init.py
+++ b/opencompass/evaluator/init.py
@ -1,2 +1,3 @@
 from .cascade_evaluator import CascadeEvaluator  # noqa
 from .generic_llm_evaluator import GenericLLMEvaluator  # noqa
+from .math_evaluator import MATHVerifyEvaluator  # noqa
--- a/opencompass/evaluator/cascade_evaluator.py
+++ b/opencompass/evaluator/cascade_evaluator.py
@ -34,7 +34,8 @@ class CascadeEvaluator(BaseEvaluator):
        sample_score_fn: Optional[Callable] = None,
        parallel: bool = True,
    ) -> None:
-        self.logger = get_logger()
+        super().__init__()
+        self.logger = get_logger(__name__)

        # Initialize the LLM evaluator
        llm_evaluator_type = llm_evaluator.pop('type')
@ -58,7 +59,10 @@ class CascadeEvaluator(BaseEvaluator):
            raise ValueError(
                'Either rule_evaluator or sample_score_fn must be provided')

-    def sample_score(self, prediction: str, reference: str) -> Dict[str, Any]:
+    def sample_score(self,
+                     prediction: str,
+                     reference: str,
+                     test_set=None) -> Dict[str, Any]:
        """Score a single sample using sample_score_fn or rule_evaluator.

        Args:
@ -70,7 +74,7 @@ class CascadeEvaluator(BaseEvaluator):
        """
        if self.sample_score_fn:
            # Use user-provided function to evaluate a single sample
-            result = self.sample_score_fn(prediction, reference)
+            result = self.sample_score_fn(prediction, reference, test_set)
            if not isinstance(result, dict):
                # Ensure result is a dictionary with at least 'correct' field
                result = {
@ -82,7 +86,8 @@ class CascadeEvaluator(BaseEvaluator):
        else:
            # Use rule_evaluator to evaluate a single sample by calling
            # the score method with single-element lists
-            result = self.rule_evaluator.score([prediction], [reference])
+            result = self.rule_evaluator.score([prediction], [reference],
+                                               [test_set])
            if 'details' in result and len(result['details']) > 0:
                return result['details'][0]
            else:
@ -137,7 +142,14 @@ class CascadeEvaluator(BaseEvaluator):
        failed_indices = []

        for i, (pred, ref) in enumerate(zip(predictions, references)):
-            result = self.sample_score(pred, ref)
+            if test_set is not None:
+                test_item = test_set[i]
+            else:
+                test_item = None
+            # Apply prediction postprocessing for each sample
+            [pred] = self.rule_evaluator.pred_postprocess([pred])
+
+            result = self.sample_score(pred, ref, test_item)
            result['evaluation_method'] = 'rule'
            details.append({'rule_evaluation': result})

@ -181,8 +193,11 @@ class CascadeEvaluator(BaseEvaluator):
            original_out_dir = getattr(self.llm_evaluator, '_out_dir', None)
            self.llm_evaluator._out_dir = f'{self._out_dir}_llm_judge'

+            # Generate random hash suffix
+            llm_results_path = f'{self.llm_evaluator._out_dir}_replica{self.dataset_replica_idx}.json'  # noqa
+            self.logger.info(f'LLM evaluation results will be saved at '
+                             f'{llm_results_path}')
            # Check if results already exist to avoid re-evaluation
-            llm_results_path = f'{self.llm_evaluator._out_dir}.json'
            if os.path.exists(llm_results_path):
                self.logger.info(
                    f'Loading existing LLM evaluation results from '
@ -212,7 +227,15 @@ class CascadeEvaluator(BaseEvaluator):
                # Use GenericLLMEvaluator to evaluate samples
                # unset dataset_cfg for GenericLLMEvaluator to
                # directly use test_set
+                # self.llm_evaluator.output_path = llm_results_path
+                self.llm_evaluator._dataset_replica_idx = \
+                    self._dataset_replica_idx
                self.llm_evaluator.dataset_cfg = None
+
+                # Apply prediction postprocessing to for LLM evaluator
+                failed_predictions = self.llm_evaluator.pred_postprocess(
+                    failed_predictions)
+
                llm_results = self.llm_evaluator.score(
                    predictions=failed_predictions,
                    references=failed_references,
@ -235,6 +258,9 @@ class CascadeEvaluator(BaseEvaluator):

            # Update the details for samples that were evaluated by LLM
            for i, llm_detail in enumerate(llm_details.values()):
+                # Add dataset replica index to LLM evaluation result
+                llm_detail['dataset_replica_idx'] = self.dataset_replica_idx
+
                original_index = failed_indices[i]
                # Store original rule-based evaluation result
                rule_result = details[original_index].copy()
@ -283,6 +309,16 @@ class CascadeEvaluator(BaseEvaluator):
                    f'LLM evaluation: {llm_correct}/{llm_evaluated} '
                    f'correct ({llm_accuracy:.2f}%)')

+            # Append cascade correctness flag to each sample
+            for item in details:
+                _rule_correct = item['rule_evaluation'].get('correct', False)
+                if 'llm_evaluation' in item:
+                    _llm_correct = item['llm_evaluation'].get(
+                        'llm_correct', False)
+                else:
+                    _llm_correct = False
+                item['cascade_correct'] = _rule_correct or _llm_correct
+
            result = {
                'accuracy': final_accuracy,
                'cascade_stats': {
--- a/opencompass/evaluator/generic_llm_evaluator.py
+++ b/opencompass/evaluator/generic_llm_evaluator.py
@ -1,5 +1,6 @@
 import os
 import os.path as osp
+from copy import deepcopy
 from typing import Dict, List, Optional

 import mmengine
@ -14,6 +15,8 @@ from opencompass.registry import (DICT_POSTPROCESSORS, ICL_PROMPT_TEMPLATES,
 from opencompass.utils import build_dataset_from_cfg, build_model_from_cfg
 from opencompass.utils.logging import get_logger

+logger = get_logger(__name__)
+

 class GenericLLMEvaluator(BaseEvaluator):
    """Generic LLM evaluator.
@ -23,6 +26,7 @@ class GenericLLMEvaluator(BaseEvaluator):
        judge_cfg (ConfigDict): The config for Judge LLM.
        dataset_cfg (ConfigDict): The config for dataset.
        pred_postprocessor (ConfigDict): The config for postprocessor.
+            used for the prediction results.
        dict_postprocessor (ConfigDict): The config for postprocessor,
            used for evaluation results dict.
    """
@ -36,8 +40,7 @@ class GenericLLMEvaluator(BaseEvaluator):
        dict_postprocessor: Optional[ConfigDict] = None,
        keep_predictions: bool = False,
    ) -> None:
-
-        self.logger = get_logger()
+        super().__init__(pred_postprocessor=pred_postprocessor)
        # If judge_cfg is not provided, fall back to the default configuration
        if not judge_cfg:
            self.judge_cfg = self.default_judge_cfg
@ -54,14 +57,14 @@ class GenericLLMEvaluator(BaseEvaluator):
        self.dict_postprocessor = dict_postprocessor
        self.pred_postprocessor = pred_postprocessor

-    def build_inferencer(self, ):
+    def build_inferencer(self):
        """Build LLM Inference."""
-        output_path = self._out_dir
-        self.output_path = f'{output_path}.json'
-        out_dir, out_name = osp.split(output_path)
-        out_name = f'{out_name}.json'

-        self.logger.info(
+        self.output_path = f'{self._out_dir}_replica{self.dataset_replica_idx}.json'  # noqa
+        logger.info(f'LLM judge details will be saved at:{self.output_path}')
+        out_dir, out_name = osp.split(self.output_path)
+
+        logger.info(
            f'Set self.output_path to {self.output_path} for current task')
        assert self.output_path is not None, 'output_path is None'

@ -98,7 +101,6 @@ class GenericLLMEvaluator(BaseEvaluator):

        # -------------- Build Inferencer ----------------
        self.build_inferencer()
-
        # ---------------- Process Predictions ------------------
        predictions = self.pred_postprocess(predictions)

@ -178,7 +180,7 @@ class GenericLLMEvaluator(BaseEvaluator):
        if self.dict_postprocessor is None:
            return output
        else:
-            kwargs = self.dict_postprocessor
+            kwargs = deepcopy(self.dict_postprocessor)
            proc = DICT_POSTPROCESSORS.get(kwargs.pop('type'))
            sig = inspect.signature(proc)
            if 'dataset' in sig.parameters:
@ -192,7 +194,8 @@ class GenericLLMEvaluator(BaseEvaluator):
    @property
    def default_judge_cfg(self):
        from opencompass.models import OpenAISDK
-
+        logger.info('Please set your judge model in `OC_JUDGE_MODEL`, \
+            `OC_JUDGE_API_KEY`, `OC_JUDGE_API_BASE` environment variables.')
        DEFAULT_JUDGE_CFG = dict(
            type=OpenAISDK,
            path=os.environ['OC_JUDGE_MODEL'],
--- a/opencompass/openicl/icl_evaluator/math_evaluator.py
+++ b/opencompass/openicl/icl_evaluator/math_evaluator.py
@ -3,9 +3,9 @@ from opencompass.registry import ICL_EVALUATORS


@ICL_EVALUATORS.register_module()
-class MATHEvaluator(BaseEvaluator):
+class MATHVerifyEvaluator(BaseEvaluator):

-    def score(self, predictions, references):
+    def score(self, predictions, references, test_set=None):
        try:
            from latex2sympy2_extended import NormalizationConfig
            from math_verify import (ExprExtractionConfig,
--- a/opencompass/models/openai_api.py
+++ b/opencompass/models/openai_api.py
@ -556,28 +556,27 @@ class OpenAI(BaseAPIModel):

 class OpenAISDK(OpenAI):

-    def __init__(
-        self,
-        path: str = 'gpt-3.5-turbo',
-        max_seq_len: int = 16384,
-        query_per_second: int = 1,
-        rpm_verbose: bool = False,
-        retry: int = 2,
-        key: str | List[str] = 'ENV',
-        org: str | List[str] | None = None,
-        meta_template: Dict | None = None,
-        openai_api_base: str | List[str] = OPENAISDK_API_BASE,
-        openai_proxy_url: Optional[str] = None,
-        mode: str = 'none',
-        logprobs: bool | None = False,
-        top_logprobs: int | None = None,
-        temperature: float | None = None,
-        tokenizer_path: str | None = None,
-        extra_body: Dict | None = None,
-        verbose: bool = False,
-        status_code_mappings: dict = {},
-        think_tag: str = '</think>',
-    ):
+    def __init__(self,
+                 path: str = 'gpt-3.5-turbo',
+                 max_seq_len: int = 16384,
+                 query_per_second: int = 1,
+                 rpm_verbose: bool = False,
+                 retry: int = 2,
+                 key: str | List[str] = 'ENV',
+                 org: str | List[str] | None = None,
+                 meta_template: Dict | None = None,
+                 openai_api_base: str | List[str] = OPENAISDK_API_BASE,
+                 openai_proxy_url: Optional[str] = None,
+                 mode: str = 'none',
+                 logprobs: bool | None = False,
+                 top_logprobs: int | None = None,
+                 temperature: float | None = None,
+                 tokenizer_path: str | None = None,
+                 extra_body: Dict | None = None,
+                 verbose: bool = False,
+                 http_client_cfg: dict = {},
+                 status_code_mappings: dict = {},
+                 think_tag: str = '</think>'):
        super().__init__(
            path,
            max_seq_len,
@ -605,20 +604,20 @@ class OpenAISDK(OpenAI):
        else:
            self.openai_api_base = openai_api_base

-        if self.proxy_url is None:
-            self.openai_client = OpenAI(base_url=self.openai_api_base,
-                                        api_key=key)
-        else:
-            proxies = {
-                'http://': self.proxy_url,
-                'https://': self.proxy_url,
-            }
+        if self.proxy_url or http_client_cfg:
+            if self.proxy_url:
+                http_client_cfg['proxies'] = {
+                    'http://': self.proxy_url,
+                    'https://': self.proxy_url,
+                }
+
+        self.openai_client = OpenAI(
+            base_url=self.openai_api_base,
+            api_key=key,
+            http_client=httpx.Client(
+                **http_client_cfg) if http_client_cfg else None,
+        )

-            self.openai_client = OpenAI(
-                base_url=self.openai_api_base,
-                api_key=key,
-                http_client=httpx.Client(proxies=proxies),
-            )
        if self.verbose:
            self.logger.info(f'Used openai_client: {self.openai_client}')
        self.status_code_mappings = status_code_mappings
@ -679,6 +678,7 @@ class OpenAISDK(OpenAI):
            try:
                if self.verbose:
                    self.logger.info('Start calling OpenAI API')
+
                responses = self.openai_client.chat.completions.create(
                    **query_data, timeout=timeout)  # timeout in seconds
                if self.verbose:
@ -689,7 +689,6 @@ class OpenAISDK(OpenAI):
                        self.logger.info(responses)
                    except Exception:
                        pass  # noqa F841
-
                # Check if response is empty or content is empty
                if (not responses.choices or not responses.choices[0].message
                        or
--- a/opencompass/openicl/icl_evaluator/init.py
+++ b/opencompass/openicl/icl_evaluator/init.py
@ -14,4 +14,3 @@ from .icl_misc_evaluator import AveragePPLEvaluator  # noqa
 from .icl_plugin_evaluator import TEvalEvaluator  # noqa
 from .icl_toxic_evaluator import ToxicEvaluator  # noqa
 from .lm_evaluator import LMEvaluator  # noqa
-from .math_evaluator import MATHEvaluator  # noqa
--- a/opencompass/openicl/icl_evaluator/icl_base_evaluator.py
+++ b/opencompass/openicl/icl_evaluator/icl_base_evaluator.py
@ -8,6 +8,11 @@ import numpy as np
 from datasets import Dataset
 from scipy.stats import hypergeom

+from opencompass.registry import TEXT_POSTPROCESSORS
+from opencompass.utils.logging import get_logger
+
+logger = get_logger(__name__)
+

 def compute_pass_at_k(n, c, k):
    if n - c < k:
@ -39,14 +44,19 @@ def compute_mg_pass_at_k(n, c, k):

 class BaseEvaluator:

-    def __init__(self) -> None:
-        pass
+    def __init__(self, pred_postprocessor=None) -> None:
+        self.pred_postprocessor = pred_postprocessor
+        self._dataset_replica_idx = 0  # Default value for dataset_replica_idx

    @property
    def output_dir(self):
        # please see opencompass/opencompass/tasks/openicl_eval.py Line 197-200
        return self._out_dir

+    @property
+    def dataset_replica_idx(self):
+        return self._dataset_replica_idx
+
    def group(self, n: int, details: List[Dict[str, Any]],
              test_set: Dataset) -> Dict[str, Any]:
        example2replications = {}
@ -82,6 +92,15 @@ class BaseEvaluator:
                [detail[metric] for detail in details])
        return g_passk_details

+    def pred_postprocess(self, predictions: List) -> Dict:
+        if not hasattr(
+                self, 'pred_postprocessor') or self.pred_postprocessor is None:
+            return predictions
+        else:
+            kwargs = deepcopy(self.pred_postprocessor)
+            proc = TEXT_POSTPROCESSORS.get(kwargs.pop('type'))
+            return [proc(pred, **kwargs) for pred in predictions]
+
    def evaluate(
        self,
        k: Union[int, List[int]],
@ -98,10 +117,14 @@ class BaseEvaluator:
                raise ValueError(
                    'Predictions and references must have the same length')

-        real_size = len(original_dataset) // n
+        real_size = len(original_dataset) // n  # dataset size of each replica
        all_details = []
        all_results = []
+
+        # Run evaluation for each replica
        for i in range(n):
+            self._dataset_replica_idx = i
+            logger.info(f'Running {i}-th replica of evaluation')

            def select_fn(i, real_size, x):
                if isinstance(x, Dataset):
@ -111,11 +134,14 @@ class BaseEvaluator:
                else:
                    return x

-            results = self.score(
-                **{
-                    key: select_fn(i, real_size, value)
-                    for key, value in score_kwargs.items()
-                })
+            current_params = {
+                key: select_fn(i, real_size, value)
+                for key, value in score_kwargs.items()
+            }
+
+            current_params['predictions'] = self.pred_postprocess(
+                current_params['predictions'])
+            results = self.score(**current_params)
            details = results.pop('details', None)
            if details is not None:
                if isinstance(details, Dict):
@ -124,11 +150,11 @@ class BaseEvaluator:
            all_results.append(results)

        eval_results = {}
-        for single_results in all_results:
-            for key in single_results:
+        for single_replica_results in all_results:
+            for key in single_replica_results:
                if key not in eval_results:
                    eval_results[key] = []
-                eval_results[key].append(single_results[key])
+                eval_results[key].append(single_replica_results[key])
        for key in deepcopy(eval_results):
            if isinstance(eval_results[key][0], float) or isinstance(
                    eval_results[key][0], int):
@ -138,9 +164,8 @@ class BaseEvaluator:
                    eval_results.pop(key)
                else:
                    eval_results[key] = np.mean(eval_results[key])
-            else:
-                eval_results[key] = eval_results[key][0]

+        # Calculate the additional metrics
        grouped_examples = self.group(n, all_details, original_dataset)
        can_calculate = False
        if len(all_details) != 0:
@ -158,6 +183,10 @@ class BaseEvaluator:
                    elif example['detail'].get('is_correct', None) is not None:
                        can_calculate = True
                        c += int(example['detail']['is_correct'])
+                    elif example['detail'].get('cascade_correct',
+                                               None) is not None:
+                        can_calculate = True
+                        c += int(example['detail']['cascade_correct'])

                k_list = [k] if isinstance(k, int) else k
                if can_calculate and n > 1 and max(k_list) > 1:
--- a/opencompass/openicl/icl_evaluator/icl_hf_evaluator.py
+++ b/opencompass/openicl/icl_evaluator/icl_hf_evaluator.py
@ -1,10 +1,11 @@
 import os
 import random
-from typing import List
+from typing import List, Optional

 import evaluate
 import numpy as np
 from datasets import Dataset
+from mmengine.config import ConfigDict

 from opencompass.registry import ICL_EVALUATORS

@ -19,12 +20,17 @@ class HuggingfaceEvaluator(BaseEvaluator):
        seed (int): There exists some randomness during the calculation of some
            metrics, thus we set a fixed random seed for reproducing. Defaults
            to 0.
+        pred_postprocessor (optional): Function or configuration for prediction
+            post-processing.
    """

-    def __init__(self, metric: str, seed: int = 0) -> None:
+    def __init__(self,
+                 metric: str,
+                 seed: int = 0,
+                 pred_postprocessor=None) -> None:
        self.metric = metric
        self.seed = seed
-        super().__init__()
+        super().__init__(pred_postprocessor=pred_postprocessor)

    def _preprocess(self, predictions: List, references: List) -> dict:
        """Preprocess the final predictions and references to needed format.
@ -52,7 +58,10 @@ class HuggingfaceEvaluator(BaseEvaluator):
        """
        return scores

-    def score(self, predictions: List, references: List) -> dict:
+    def score(self,
+              predictions: List,
+              references: List,
+              test_set=None) -> dict:
        """Calculate scores.

        Args:
@ -92,10 +101,15 @@ class HuggingfaceEvaluator(BaseEvaluator):
 class AccEvaluator(HuggingfaceEvaluator):
    """Accuracy evaluator."""

-    def __init__(self) -> None:
-        super().__init__(metric='accuracy')
+    def __init__(self,
+                 pred_postprocessor: Optional[ConfigDict] = None) -> None:
+        super().__init__(metric='accuracy',
+                         pred_postprocessor=pred_postprocessor)

-    def _preprocess(self, predictions: List, references: List) -> dict:
+    def _preprocess(self,
+                    predictions: List,
+                    references: List,
+                    test_set=None) -> dict:
        """Preprocess the final predictions and references to needed format.

        Args:
@ -187,8 +201,9 @@ class RougeEvaluator(HuggingfaceEvaluator):
    Note: this evaluator is not suitable for chinese datasets.
    """

-    def __init__(self) -> None:
-        super().__init__(metric='rouge')
+    def __init__(self,
+                 pred_postprocessor: Optional[ConfigDict] = None) -> None:
+        super().__init__(metric='rouge', pred_postprocessor=pred_postprocessor)

    def _postprocess(self, scores: dict) -> dict:
        """Postprocess for final scores.
@ -206,8 +221,10 @@ class RougeEvaluator(HuggingfaceEvaluator):
 class BleuEvaluator(HuggingfaceEvaluator):
    """Bleu evaluator."""

-    def __init__(self) -> None:
-        super().__init__(metric='sacrebleu')
+    def __init__(self,
+                 pred_postprocessor: Optional[ConfigDict] = None) -> None:
+        super().__init__(metric='sacrebleu',
+                         pred_postprocessor=pred_postprocessor)


 class BleuFloresEvaluator(HuggingfaceEvaluator):
--- a/opencompass/partitioners/num_worker.py
+++ b/opencompass/partitioners/num_worker.py
@ -26,6 +26,7 @@ class NumWorkerPartitioner(BasePartitioner):
        dataset_size_path (str): The path to the dataset size cache file.
        keep_keys (list[str]): The keys to be kept from the experiment config
            to the task config.
+        force_rebuild (bool): Whether to force rebuild dataset to get size.
    """

    def __init__(self,
@ -35,7 +36,8 @@ class NumWorkerPartitioner(BasePartitioner):
                 min_task_size: int = 16,
                 strategy: str = 'heuristic',
                 dataset_size_path: str = '.cache/dataset_size.json',
-                 keep_keys: Optional[List[str]] = None):
+                 keep_keys: Optional[List[str]] = None,
+                 force_rebuild: bool = False):
        super().__init__(out_dir=out_dir, keep_keys=keep_keys)
        if strategy == 'split' and num_worker is not None:
            self.logger.warning('num_worker is ignored with split.')
@ -44,6 +46,7 @@ class NumWorkerPartitioner(BasePartitioner):
        self.num_split = num_split or num_worker
        self.min_task_size = min_task_size
        self.dataset_size_path = dataset_size_path
+        self.force_rebuild = force_rebuild
        assert strategy in ('heuristic', 'split'), \
            f'Unsupported partition strategy: {strategy}. '\
            'Supported strategies are: `heuristic`, `split` .'
@ -106,7 +109,7 @@ class NumWorkerPartitioner(BasePartitioner):
    @property
    def dataset_size(self):
        if not hasattr(self, '_dataset_size'):
-            if osp.exists(self.dataset_size_path):
+            if not self.force_rebuild and osp.exists(self.dataset_size_path):
                self._dataset_size = mmengine.load(self.dataset_size_path)
            else:
                self._dataset_size = {}
@ -130,22 +133,25 @@ class NumWorkerPartitioner(BasePartitioner):

    def get_size(self, dataset: ConfigDict) -> int:
        dataset_abbr = dataset_abbr_from_cfg(dataset)
-
        test_range = dataset.reader_cfg.get('test_range', '')

-        if dataset_abbr in self.dataset_size:
+        # If not forcing rebuild and data exists in cache, use the cache
+        if not self.force_rebuild and dataset_abbr in self.dataset_size:
            actual_size = eval('len(range(self.dataset_size[dataset_abbr])'
                               f'{test_range})')
            return actual_size

+        # Otherwise, rebuild the dataset to get its size
        dataset = build_dataset_from_cfg(dataset)
        self.dataset_size[dataset_abbr] = len(dataset.test)

-        mmengine.mkdir_or_exist('.cache/')
-        mmengine.dump(self.dataset_size,
-                      self.dataset_size_path,
-                      indent=4,
-                      ensure_ascii=False)
+        # Save to cache file
+        if self.dataset_size_path:
+            mmengine.mkdir_or_exist('.cache/')
+            mmengine.dump(self.dataset_size,
+                          self.dataset_size_path,
+                          indent=4,
+                          ensure_ascii=False)

        actual_size = eval('len(range(self.dataset_size[dataset_abbr])'
                           f'{test_range})')
--- a/opencompass/tasks/openicl_eval.py
+++ b/opencompass/tasks/openicl_eval.py
@ -146,11 +146,16 @@ class OpenICLEvalTask(BaseTask):
            preds = []
            i = 1
            while osp.exists(osp.realpath(filename)):
-                sub_preds = mmengine.load(filename)
-                preds.extend(
-                    [sub_preds[str(i)] for i in range(len(sub_preds))])
-                filename = root + f'_{i}' + ext
-                i += 1
+                try:
+                    sub_preds = mmengine.load(filename)
+                    preds.extend(
+                        [sub_preds[str(i)] for i in range(len(sub_preds))])
+                    filename = root + f'_{i}' + ext
+                    i += 1
+                except Exception as e:
+                    self.logger.error(
+                        f'Error loading prediction file {filename}: {e}')
+                    break

        pred_dicts = copy.deepcopy(preds)
        preds = {k: [pred.get(k) for pred in preds] for k in preds[0]}
--- a/opencompass/utils/logging.py
+++ b/opencompass/utils/logging.py
@ -2,6 +2,8 @@ import logging
 import os

 from mmengine.logging import MMLogger
+from rich.console import Console
+from rich.syntax import Syntax

 _nameToLevel = {
    'CRITICAL': logging.CRITICAL,
@ -79,3 +81,14 @@ class FilterDuplicateMessage(logging.Filter):
            self.seen.add(record.msg)
            return True
        return False
+
+
+def pretty_print_config(cfg):
+    """Pretty print config using the rich library."""
+    console = Console()
+    config_str = cfg.pretty_text
+    syntax = Syntax(config_str,
+                    'python',
+                    theme='solarized-dark',
+                    line_numbers=True)
+    console.print(syntax)
--- a/opencompass/utils/run.py
+++ b/opencompass/utils/run.py
@ -150,6 +150,13 @@ def get_config_from_arg(args) -> Config:
            dataset['meta_path'] = args.custom_dataset_meta_path
        dataset = make_custom_dataset_config(dataset)
        datasets.append(dataset)
+    ## apply the dataset repeat runs
+    if len(datasets) > 0 and args.dataset_num_runs > 1:
+        logger.warning(f'User has set the --dataset-num-runs, the datasets will be evaluated with {args.dataset_num_runs} runs.')
+        for _dataset in datasets:
+            logger.warning(f"The default num runs of {_dataset['abbr']} is: {_dataset['n']}, changed into: {args.dataset_num_runs}")
+            _dataset['n'] = args.dataset_num_runs
+            _dataset['k'] = args.dataset_num_runs

    # parse model args
    if not args.models and not args.hf_path:
@ -204,7 +211,6 @@ def get_config_from_arg(args) -> Config:
    summarizers_dir = [
        os.path.join(args.config_dir, 'summarizers'),
        os.path.join(default_configs_dir, './summarizers'),
-
    ]

    # Check if summarizer_arg contains '/'
@ -308,7 +314,7 @@ def change_accelerator(models, accelerator):
                    model_kwargs=model_kwargs,
                    max_seq_len=model.get('max_seq_len', None),
                    max_out_len=model['max_out_len'],
-                    batch_size=16,
+                    batch_size=model.get('batch_size', 16),
                    run_cfg=model['run_cfg'],
                    stop_words=model.get('stop_words', []),
                )
@ -335,7 +341,7 @@ def change_accelerator(models, accelerator):
                    gen_config=gen_config,
                    max_seq_len=model.get('max_seq_len', None),
                    max_out_len=model['max_out_len'],
-                    batch_size=16,
+                    batch_size=model.get('batch_size', 16),
                    run_cfg=model['run_cfg'],
                    stop_words=model.get('stop_words', []),
                )