[Update] Add CascadeEvaluator with Data Replica (#2022)

* Update CascadeEvaluator

* Update CascadeEvaluator

* Update CascadeEvaluator

* Update Config

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update
This commit is contained in:
Songyang Zhang 2025-05-20 16:46:55 +08:00 committed by GitHub
parent 7a7a4517ab
commit aa2b89b6f8
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
43 changed files with 1471 additions and 269 deletions

View File

@ -60,7 +60,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
- **\[2025.04.01\]** OpenCompass now supports `CascadeEvaluator`, a flexible evaluation mechanism that allows multiple evaluators to work in sequence. This enables creating customized evaluation pipelines for complex assessment scenarios. Check out the [documentation](docs/en/advanced_guides/llm_judge.md) for more details! 🔥🔥🔥
- **\[2025.03.11\]** We have supported evaluation for `SuperGPQA` which is a great benchmark for measuring LLM knowledge ability 🔥🔥🔥
- **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHVerifyEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
- **\[2025.01.16\]** We now support the [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) model which has enhanced performance on reasoning and knowledge-intensive tasks.
- **\[2024.12.17\]** We have provided the evaluation script for the December [CompassAcademic](examples/eval_academic_leaderboard_202412.py), which allows users to easily reproduce the official evaluation results by configuring it.
- **\[2024.11.14\]** OpenCompass now offers support for a sophisticated benchmark designed to evaluate complex reasoning skills — [MuSR](https://arxiv.org/pdf/2310.16049). Check out the [demo](examples/eval_musr.py) and give it a spin! 🔥🔥🔥
@ -246,7 +246,7 @@ Currently, OpenCompass have provided standard recommended configurations for dat
opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
# Recommended Evaluation Config based on LLM Judge
opencompass --datasets aime2024_llm_judge_gen --models hf_internlm2_5_1_8b_chat
opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
```
If you want to use multiple GPUs to evaluate the model in data parallel, you can use `--max-num-worker`.

View File

@ -60,7 +60,7 @@
- **\[2025.04.01\]** OpenCompass 现已支持 `CascadeEvaluator`,允许多个评估器按顺序工作,可以为更复杂的评估场景创建自定义评估流程,查看[文档](docs/zh_cn/advanced_guides/llm_judge.md)了解具体用法!🔥🔥🔥
- **\[2025.03.11\]** 现已支持 `SuperGPQA` 覆盖285 个研究生学科的知识能力评测,欢迎尝试!🔥🔥🔥
- **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程,请查看 [评估推理模型](docs/zh_cn/user_guides/deepseek_r1.md) 了解更多详情!🔥🔥🔥
- **\[2025.02.15\]** 我们新增了两个实用的评测工具用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情!🔥🔥🔥
- **\[2025.02.15\]** 我们新增了两个实用的评测工具用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHVerifyEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情!🔥🔥🔥
- **\[2025.01.16\]** 我们现已支持 [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) 模型,该模型在推理、知识类任务上取得同量级最优性能,欢迎尝试。
- **\[2024.12.17\]** 我们提供了12月CompassAcademic学术榜单评估脚本 [CompassAcademic](configs/eval_academic_leaderboard_202412.py),你可以通过简单地配置复现官方评测结果。
- **\[2024.10.14\]** 现已支持OpenAI多语言问答数据集[MMMLU](https://huggingface.co/datasets/openai/MMMLU),欢迎尝试! 🔥🔥🔥
@ -237,7 +237,7 @@ humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ce
opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
# 基于LLM Judge的推荐配置
opencompass --datasets aime2024_llm_judge_gen --models hf_internlm2_5_1_8b_chat
opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
```
此外,如果你想在多块 GPU 上使用模型进行推理,您可以使用 `--max-num-worker` 参数。

View File

@ -303,7 +303,7 @@
category: Examination
paper: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
configpath: opencompass/configs/datasets/aime2024/aime2024_gen.py
configpath_llmjudge: opencompass/configs/datasets/aime2024/aime2024_llm_judge_gen.py
configpath_llmjudge: opencompass/configs/datasets/aime2024/aime2024_llmjudge_gen.py
- anli:
name: Adversarial NLI
category: Reasoning

View File

@ -278,7 +278,7 @@ Here's an example of how to configure the CascadeEvaluator:
```python
# Define a rule-based evaluator
rule_evaluator = dict(type=MATHEvaluator)
rule_evaluator = dict(type=MATHVerifyEvaluator)
# Define an LLM judge evaluator
llm_judge_evaluator = dict(

View File

@ -2,7 +2,7 @@
## Introduction
Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHEvaluator components.
Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHVerifyEvaluator components.
## Dataset Format
@ -61,7 +61,7 @@ math_infer_cfg = dict(
```python
math_eval_cfg = dict(
evaluator=dict(type=MATHEvaluator),
evaluator=dict(type=MATHVerifyEvaluator),
)
```
@ -86,11 +86,11 @@ math_datasets = [
]
```
## MATHEvaluator
## MATHVerifyEvaluator
The MATHEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.
The MATHVerifyEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.
The MATHEvaluator implements:
The MATHVerifyEvaluator implements:
1. Extracts answers from both predictions and references using LaTeX extraction
2. Handles various LaTeX formats and environments
@ -133,7 +133,7 @@ Here's a complete example of how to set up math evaluation:
from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.datasets import CustomDataset
from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
from opencompass.openicl.icl_evaluator.math_evaluator import MATHVerifyEvaluator
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
@ -160,7 +160,7 @@ math_infer_cfg = dict(
# Evaluation configuration
math_eval_cfg = dict(
evaluator=dict(type=MATHEvaluator),
evaluator=dict(type=MATHVerifyEvaluator),
)
# Dataset configuration

View File

@ -277,7 +277,7 @@ OpenCompass还提供了级联评估器`CascadeEvaluator`,它结合了规则式
```python
# 定义规则式评估器
rule_evaluator = dict(type=MATHEvaluator)
rule_evaluator = dict(type=MATHVerifyEvaluator)
# 定义LLM评判器
llm_judge_evaluator = dict(

View File

@ -2,7 +2,7 @@
## 简介
数学推理能力是大语言模型(LLMs)的一项关键能力。为了评估模型的数学能力我们需要测试其逐步解决数学问题并提供准确最终答案的能力。OpenCompass 通过 CustomDataset 和 MATHEvaluator 组件提供了一种便捷的数学推理评测方式。
数学推理能力是大语言模型(LLMs)的一项关键能力。为了评估模型的数学能力我们需要测试其逐步解决数学问题并提供准确最终答案的能力。OpenCompass 通过 CustomDataset 和 MATHVerifyEvaluator 组件提供了一种便捷的数学推理评测方式。
## 数据集格式
@ -61,7 +61,7 @@ math_infer_cfg = dict(
```python
math_eval_cfg = dict(
evaluator=dict(type=MATHEvaluator),
evaluator=dict(type=MATHVerifyEvaluator),
)
```
@ -86,11 +86,11 @@ math_datasets = [
]
```
## MATHEvaluator
## MATHVerifyEvaluator
MATHEvaluator 是专门设计用于评估数学答案的评测器。它基于 math_verify 库进行开发,该库提供了数学表达式解析和验证功能,支持 LaTeX 和一般表达式的提取与等价性验证。
MATHVerifyEvaluator 是专门设计用于评估数学答案的评测器。它基于 math_verify 库进行开发,该库提供了数学表达式解析和验证功能,支持 LaTeX 和一般表达式的提取与等价性验证。
MATHEvaluator 具有以下功能:
MATHVerifyEvaluator 具有以下功能:
1. 使用 LaTeX 提取器从预测和参考答案中提取答案
2. 处理各种 LaTeX 格式和环境
@ -133,7 +133,7 @@ MATHEvaluator 具有以下功能:
from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.datasets import CustomDataset
from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
from opencompass.evaluator import MATHVerifyEvaluator
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
@ -160,7 +160,7 @@ math_infer_cfg = dict(
# 评测配置
math_eval_cfg = dict(
evaluator=dict(type=MATHEvaluator),
evaluator=dict(type=MATHVerifyEvaluator),
)
# 数据集配置

View File

@ -7,9 +7,12 @@ from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.evaluator import GenericLLMEvaluator, CascadeEvaluator
from opencompass.evaluator import (
GenericLLMEvaluator,
CascadeEvaluator,
MATHVerifyEvaluator,
)
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.openicl.icl_evaluator import MATHEvaluator
from opencompass.datasets import (
MATHDataset,
math_postprocess_v2,
@ -94,7 +97,7 @@ llm_judge_evaluator = dict(
judge_cfg=dict(),
)
rule_evaluator =dict(type=MATHEvaluator)
rule_evaluator =dict(type=MATHVerifyEvaluator)
cascade_evaluator = dict(type=CascadeEvaluator,
llm_evaluator=llm_judge_evaluator,
rule_evaluator=rule_evaluator,

142
examples/eval_qwen3.py Normal file
View File

@ -0,0 +1,142 @@
import os.path as osp
from opencompass.models import OpenAISDK
from mmengine.config import read_base
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
from opencompass.runners import LocalRunner
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
with read_base():
from opencompass.configs.datasets.aime2024.aime2024_cascade_eval_gen_5e9f4f import aime2024_datasets
from opencompass.configs.datasets.aime2025.aime2025_cascade_eval_gen_5e9f4f import aime2025_datasets
from opencompass.configs.datasets.math.math_500_cascade_eval_gen_6ff468 import math_datasets
#######################################################################
# PART 0 Meta Info #
#######################################################################
api_meta_template = dict(round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
)
judge_cfg = dict(
abbr='qwen2-5-32B-Instruct',
type=OpenAISDK,
path='Qwen/Qwen2.5-32B-Instruct',
key='sk-1234',
openai_api_base=[
'http://x.x.x.x:4000/v1',
],
meta_template=api_meta_template,
query_per_second=8,
batch_size=256,
temperature=0.001,
# max_completion_tokens=32768,
tokenizer_path='gpt-4o-2024-05-13',
# verbose=True,
max_out_len=16384,
max_seq_len=32768,
# max_seq_len=49152,
mode='mid',
retry=10
)
#######################################################################
# PART 1 Datasets List #
#######################################################################
repeated_info = [
(math_datasets, 4),
(aime2024_datasets, 32),
(aime2025_datasets, 32),
]
for datasets_, num in repeated_info:
for dataset_ in datasets_:
dataset_['n'] = num
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
for item in datasets:
item['infer_cfg']['inferencer']['max_out_len'] = 32768
try:
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
elif'judge_cfg' in item['eval_cfg']['evaluator']['llm_evaluator']:
item['eval_cfg']['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
except:
pass
#######################################################################
# PART 2 Dataset Summarizer #
#######################################################################
summarizer = dict(
dataset_abbrs=[
'MATH',
['math_prm800k_500', 'accuracy (4 runs average)'],
['aime2024', 'accuracy (32 runs average)'],
['aime2025', 'accuracy (32 runs average)'],
['livemathbench_hard', 'naive_average'],
['OlympiadBenchMath', 'accuracy'],
['olymmath', 'naive_average'],
],
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []
),
)
#######################################################################
# PART 3 Models List #
#######################################################################
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
models += [
dict(
abbr='Qwen_Qwen3-235B-A22B',
type=OpenAISDK,
path='Qwen/Qwen3-235B-A22B',
key='sk-admin',
openai_api_base=[
'http://106.15.231.215:40007/v1/',
],
meta_template=dict(
# begin=dict(role='SYSTEM', api_role='SYSTEM', prompt=''),
round=[
dict(role='HUMAN', api_role='HUMAN'),
# XXX: all system roles are mapped to human in purpose
dict(role='BOT', api_role='BOT', generate=True),
]
),
query_per_second=16,
batch_size=128,
# batch_size=1,
temperature=0.6,
# max_completion_tokens=32768,
tokenizer_path='gpt-4',
# verbose=True,
max_out_len=32768,
max_seq_len=32768,
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
]
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
)
eval = dict(
partitioner=dict(type=NaivePartitioner, n=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)),
)
base_exp_dir = 'outputs/qwen3_reasoning'
work_dir = osp.join(base_exp_dir, 'chat_objective')

View File

@ -12,8 +12,8 @@ from mmengine.config import Config, DictAction
from opencompass.registry import PARTITIONERS, RUNNERS, build_from_cfg
from opencompass.runners import SlurmRunner
from opencompass.summarizers import DefaultSummarizer
from opencompass.utils import (LarkReporter, get_logger, read_from_station,
save_to_station)
from opencompass.utils import (LarkReporter, get_logger, pretty_print_config,
read_from_station, save_to_station)
from opencompass.utils.run import (fill_eval_cfg, fill_infer_cfg,
get_config_from_arg)
@ -94,6 +94,11 @@ def parse_args():
help='Use the custom config directory instead of config/ to '
'search the configs for datasets, models and summarizers',
type=str)
parser.add_argument(
'--config-verbose',
default=False,
action='store_true',
help='Whether to print the config in verbose mode.')
parser.add_argument('-l',
'--lark',
help='Report the running status to lark bot',
@ -131,7 +136,7 @@ def parse_args():
'correctness of each sample, bpb, etc.',
action='store_true',
)
# for the results persistence
parser.add_argument('-sp',
'--station-path',
help='Path to your results station.',
@ -150,7 +155,12 @@ def parse_args():
'data station.',
action='store_true',
)
# for evaluation with multiple runs
parser.add_argument('--dataset-num-runs',
help='How many runs for one dataset',
type=int,
default=1,
)
# set srun args
slurm_parser = parser.add_argument_group('slurm_args')
@ -299,6 +309,11 @@ def main():
content = f'{getpass.getuser()}\'s task has been launched!'
LarkReporter(cfg['lark_bot_url']).post(content)
# print config if specified --config-verbose
if args.config_verbose:
pretty_print_config(cfg)
# infer
if args.mode in ['all', 'infer']:
# When user have specified --slurm or --dlc, or have not set

View File

@ -0,0 +1,109 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.datasets import OlymMATHDataset
from opencompass.evaluator import (
CascadeEvaluator,
GenericLLMEvaluator,
MATHVerifyEvaluator
)
# ----------------------------- Detailed Config -----------------------------
math_reader_cfg = dict(input_columns=['problem'], output_column='answer', train_split='test')
math_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{problem}\nRemember to put your final answer within \\boxed{}.'),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
sub_sets = ['en-hard', 'zh-hard', 'en-easy', 'zh-easy']
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{problem}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
# Evaluation configuration
olymmath_datasets = []
for sub_set in sub_sets:
math_eval_cfg = dict(
evaluator=dict(
type=CascadeEvaluator,
rule_evaluator=dict(
type=MATHVerifyEvaluator,
),
llm_evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
],
round=[
dict(
role='HUMAN',
prompt = GRADER_TEMPLATE
),
]),
),
dataset_cfg=dict(
type=OlymMATHDataset,
path='RUC-AIBOX/OlymMATH',
reader_cfg=math_reader_cfg,
subset=sub_set,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
parallel=False,
),
)
olymmath_datasets.append(
dict(
type=OlymMATHDataset,
abbr=f'olymmath_{sub_set}',
path='RUC-AIBOX/OlymMATH',
reader_cfg=math_reader_cfg,
infer_cfg=math_infer_cfg,
eval_cfg=math_eval_cfg,
subset=sub_set,
n=1
)
)

View File

@ -0,0 +1,114 @@
from mmengine.config import read_base
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import OlympiadBenchDataset, OlympiadBenchEvaluator, olympiadbench_postprocess_v2
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.evaluator import (
GenericLLMEvaluator,
CascadeEvaluator,
MATHVerifyEvaluator
)
from opencompass.datasets import generic_llmjudge_postprocess
with read_base():
from .OlympiadBench_categories import categories
# Create prompter instance for problems
olympiadbench_prompter_cfg = dict(
type='OlympiadBenchPrompter'
)
olympiadbench_reader_cfg = dict(
input_columns=[
'problem', 'language', 'subject', 'question_type',
'answer_type', 'is_multiple_answer', 'unit', 'questions'
],
output_column='solution'
)
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{problem}\n<Original Question End>\n\n
<Gold Target Begin>: \n{solution}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
olympiadbench_datasets = []
for _name in categories:
olympiadbench_infer_cfg = dict(
prompt_template=dict(
type='OlympiadBenchTemplate'
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Evaluation configuration
olympiadbench_eval_cfg = dict(
evaluator=dict(
type=CascadeEvaluator,
rule_evaluator=dict(
type=MATHVerifyEvaluator,
),
llm_evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
],
round=[
dict(
role='HUMAN',
prompt = GRADER_TEMPLATE
),
]),
),
dataset_cfg=dict(
type=OlympiadBenchDataset,
path='opencompass/OlympiadBench',
name=_name,
reader_cfg=olympiadbench_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
parallel=False
)
)
olympiadbench_datasets.append(
dict(
type=OlympiadBenchDataset,
abbr=f'OlympiadBench_{_name}',
path='opencompass/OlympiadBench',
name=_name,
reader_cfg=olympiadbench_reader_cfg,
infer_cfg=olympiadbench_infer_cfg,
eval_cfg=olympiadbench_eval_cfg,
n=1,
)
)

View File

@ -1,28 +1,44 @@
"""
Summary: A config for AIME-2024 Evaluation.
Setting:
Shot: 0-shot
Evaluator:
- CascadeEvaluator
- MATHVerifyEvaluator
- GenericLLMEvaluator
Repeat: 1
Avaliable Models:
- Instruct/Chat Models
"""
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import Aime2024Dataset, MATHEvaluator, math_postprocess_v2
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.utils import xml_tag_postprocessor
aime2024_reader_cfg = dict(
input_columns=['question'],
output_column='answer'
from opencompass.datasets import Aime2024Dataset
from opencompass.evaluator import (
CascadeEvaluator,
GenericLLMEvaluator,
MATHVerifyEvaluator
)
aime2024_reader_cfg = dict(input_columns=['question'], output_column='answer')
aime2024_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{question}\nRemember to put your final answer within \\boxed{}.'),
dict(
role='HUMAN',
prompt='{question}\nRemember to put your final answer within \\boxed{}.',
),
],
)
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=2048)
inferencer=dict(type=GenInferencer),
)
@ -51,24 +67,27 @@ GRADER_TEMPLATE = """
Judging the correctness of candidates' answers:
""".strip()
aime2024_eval_cfg = dict(
evaluator=dict(
cascade_evaluator = dict(
type=CascadeEvaluator,
rule_evaluator=dict(
type=MATHVerifyEvaluator,
),
llm_evaluator= dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
],
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(
role='HUMAN',
prompt = GRADER_TEMPLATE
),
]),
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=Aime2024Dataset,
@ -77,9 +96,13 @@ aime2024_eval_cfg = dict(
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
pred_postprocessor=dict(type=xml_tag_postprocessor, tag='<conclude>'),
),
pred_role='BOT',
parallel=False,
)
aime2024_eval_cfg = dict(
evaluator=cascade_evaluator,
)
aime2024_datasets = [
@ -90,6 +113,6 @@ aime2024_datasets = [
reader_cfg=aime2024_reader_cfg,
infer_cfg=aime2024_infer_cfg,
eval_cfg=aime2024_eval_cfg,
mode='singlescore',
n=1,# Evaluate the dataset with n times
)
]
]

View File

@ -0,0 +1,115 @@
"""
Summary: A config for AIME-2025 Evaluation.
Setting:
Shot: 0-shot
Evaluator:
- CascadeEvaluator
- MATHVerifyEvaluator
- GenericLLMEvaluator
Repeat: 1
Avaliable Models:
- Instruct/Chat Models
"""
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import CustomDataset
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.evaluator import (
CascadeEvaluator,
GenericLLMEvaluator,
MATHVerifyEvaluator
)
aime2025_reader_cfg = dict(input_columns=['question'], output_column='answer')
aime2025_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{question}\nRemember to put your final answer within \\boxed{}.',
),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{question}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
cascade_evaluator = dict(
type=CascadeEvaluator,
rule_evaluator=dict(
type=MATHVerifyEvaluator,
),
llm_evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=CustomDataset,
path='opencompass/aime2025',
reader_cfg=aime2025_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
parallel=False,
)
aime2025_eval_cfg = dict(
evaluator=cascade_evaluator,
)
aime2025_datasets = [
dict(
type=CustomDataset,
abbr='aime2025',
path='opencompass/aime2025',
reader_cfg=aime2025_reader_cfg,
infer_cfg=aime2025_infer_cfg,
eval_cfg=aime2025_eval_cfg,
n=1,
)
]

View File

@ -0,0 +1,118 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import GPQADataset, GPQA_Simple_Eval_postprocess
from opencompass.evaluator import GenericLLMEvaluator, CascadeEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.utils.text_postprocessors import match_answer_pattern
# openai_simple_eval prompt
align_prompt = """
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of ABCD.
{question}
A) {A}
B) {B}
C) {C}
D) {D}
""".strip()
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: {question}\n A) {A}\n B) {B}\n C) {C}\n D) {D}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
gpqa_reader_cfg = dict(
input_columns=['question', 'A', 'B', 'C', 'D'],
output_column='answer')
gpqa_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt=align_prompt),
], )),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
gpqa_datasets = []
gpqa_subsets = {
# 'extended': 'gpqa_extended.csv',
# 'main': 'gpqa_main.csv',
'diamond': 'gpqa_diamond.csv'
}
for split in list(gpqa_subsets.keys()):
gpqa_eval_cfg = dict(
evaluator=dict(
type=CascadeEvaluator,
rule_evaluator=dict(
type=AccEvaluator,
pred_postprocessor=dict(type=match_answer_pattern, answer_pattern=r'(?i)ANSWER\s*:\s*([A-D])'),
),
llm_evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
],
round=[
dict(
role='HUMAN',
prompt = GRADER_TEMPLATE
),
]),
),
dataset_cfg=dict(
type=GPQADataset,
path='./data/gpqa/',
name=gpqa_subsets[split],
reader_cfg=gpqa_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
parallel=False,
),
)
gpqa_datasets.append(
dict(
abbr='GPQA_' + split,
type=GPQADataset,
path='./data/gpqa/',
name=gpqa_subsets[split],
reader_cfg=gpqa_reader_cfg,
infer_cfg=gpqa_infer_cfg,
eval_cfg=gpqa_eval_cfg,
mode='singlescore',
)
)

View File

@ -1,17 +1,28 @@
"""
Summary: A config for KoR-Bench Evaluation.
Setting:
Shot: 0-shot
Evaluator:
- CascadeEvaluator
- korbenchEvaluator
- GenericLLMEvaluator
Repeat: 1
Avaliable Models:
- Instruct/Chat Models
"""
from datasets import parallel
from opencompass.datasets.korbench.korbench import korbenchDataset, korbenchEvaluator
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.evaluator import GenericLLMEvaluator, CascadeEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.utils import xml_tag_postprocessor
categories = ['cipher', 'counterfactual', 'logic', 'operation', 'puzzle']
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
@ -30,7 +41,7 @@ GRADER_TEMPLATE = """
<Original Question Begin>: \n{prompt}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
@ -50,7 +61,7 @@ for category in categories:
round=[
dict(
role='HUMAN',
prompt='{prompt}' # f-string
prompt='{prompt}' # f-string
)
]
)
@ -66,41 +77,46 @@ for category in categories:
infer_cfg = dict(
prompt_template=prompt_template,
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=1024),
inferencer=dict(type=GenInferencer),
)
# Evaluation configuration
eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
],
round=[
dict(
role='HUMAN',
prompt = GRADER_TEMPLATE
),
]),
type=CascadeEvaluator,
rule_evaluator=dict(
type=korbenchEvaluator,
),
dataset_cfg=dict(
type=korbenchDataset,
path='opencompass/korbench',
prompt_mode='0_shot',
category=category,
reader_cfg=reader_cfg,
llm_evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
],
round=[
dict(
role='HUMAN',
prompt=GRADER_TEMPLATE
),
]),
),
dataset_cfg=dict(
type=korbenchDataset,
path='opencompass/korbench',
prompt_mode='0_shot',
category=category,
reader_cfg=reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
pred_postprocessor=dict(type=xml_tag_postprocessor, tag='<conclude>'),
),
pred_role='BOT',
parallel=False,
)
)
# Dataset
@ -113,7 +129,7 @@ for category in categories:
reader_cfg=reader_cfg,
infer_cfg=infer_cfg,
eval_cfg=eval_cfg,
mode='singlescore',
n=1,
)
korbench_0shot_single_datasets.append(korbench_dataset)
korbench_0shot_single_datasets.append(korbench_dataset)

View File

@ -0,0 +1,120 @@
"""
Summary: A config for LiveMathBench-Hard-202412 Dataset Evaluation.
Setting:
Shot: 0-shot
Evaluator:
- CascadeEvaluator
- MATHVerifyEvaluator
- GenericLLMEvaluator
Repeat: 32
Avaliable Models:
- Instruct/Chat Models
"""
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import CustomDataset
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.evaluator import (
CascadeEvaluator,
GenericLLMEvaluator,
MATHVerifyEvaluator,
)
livemathbench_reader_cfg = dict(input_columns=['question'], output_column='answer')
# Inference configuration
livemathbench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{question}\nRemember to put your final answer within \\boxed{}.',
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Template for the LLM judge
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{question}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
splits = ['hard_cn', 'hard_en']
# Dataset configuration
livemathbench_datasets = [
dict(
type=CustomDataset,
abbr=f'livemathbench_hard_custom_{split}',
path='data/LiveMathBench',
local_mode=True,
file_name=f'202412/{split}.jsonl',
reader_cfg=livemathbench_reader_cfg,
infer_cfg=livemathbench_infer_cfg,
eval_cfg=dict(
# Evaluation configuration using LLM as judge
evaluator=dict(
type=CascadeEvaluator,
rule_evaluator=dict(
type=MATHVerifyEvaluator,
),
llm_evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=CustomDataset,
path='data/LiveMathBench',
local_mode=True,
file_name=f'202412/{split}.jsonl',
reader_cfg=livemathbench_reader_cfg,
),
judge_cfg={},
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
parallel=False
),
),
n=1, # repeat n times
) for split in splits
]

View File

@ -4,7 +4,6 @@ from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import LiveReasonBenchDataset, livereasonbench_postprocess
from opencompass.utils import xml_tag_postprocessor
GRADER_TEMPLATE = """
@ -97,7 +96,7 @@ livereasonbench_infer_cfg = dict(
],
)),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=16384))
inferencer=dict(type=GenInferencer))
livereasonbench_eval_cfg = dict(
evaluator=dict(
@ -122,23 +121,22 @@ livereasonbench_eval_cfg = dict(
type=LiveReasonBenchDataset,
path='opencompass/LiveReasonBench',
reader_cfg=livereasonbench_reader_cfg,
version='livereasonbench-20250428',
),
judge_cfg=dict(),
dict_postprocessor=dict(type=livereasonbench_postprocess),
pred_postprocessor=dict(type=xml_tag_postprocessor, tag='<conclude>'),
),
pred_role='BOT',
)
livereasonbench_datasets = [
dict(
abbr='LiveReasonBench-20241202',
abbr='LiveReasonBench-20250428',
type=LiveReasonBenchDataset,
path='opencompass/LiveReasonBench',
reader_cfg=livereasonbench_reader_cfg,
infer_cfg=livereasonbench_infer_cfg,
eval_cfg=livereasonbench_eval_cfg,
version='livereasonbench-20241202',
mode='singlescore',
version='livereasonbench-20250428',
n=1
)
]

View File

@ -0,0 +1,117 @@
"""
Summary: A config for AIME-2024 Evaluation.
Setting:
Shot: 0-shot
Evaluator:
- CascadeEvaluator
- MATHVerifyEvaluator
- GenericLLMEvaluator
Avaliable Models:
- Instruct/Chat Models
"""
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.datasets import MATHDataset
from opencompass.evaluator import (
CascadeEvaluator,
GenericLLMEvaluator,
MATHVerifyEvaluator
)
# ----------------------------- Detailed Config -----------------------------
math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
math_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{problem}\nRemember to put your final answer within \\boxed{}.'),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{problem}\n<Original Question End>\n\n
<Gold Target Begin>: \n{solution}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
cascade_evaluator = dict(
type=CascadeEvaluator,
rule_evaluator=dict(
type=MATHVerifyEvaluator,
),
llm_evaluator= dict(
dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=MATHDataset,
path='opencompass/math',
file_name = 'test_prm800k_500.json',
reader_cfg=math_reader_cfg,
n=4,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
)
),
parallel=False,
)
math_datasets = [
dict(
type=MATHDataset,
abbr=f'math_prm800k_500',
path='opencompass/math',
file_name = 'test_prm800k_500.json',
reader_cfg=math_reader_cfg,
infer_cfg=math_infer_cfg,
eval_cfg=dict(
evaluator=cascade_evaluator,
),
n=1,
)
]

View File

@ -2,7 +2,7 @@ from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import CustomDataset
from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
from opencompass.evaluator import MATHVerifyEvaluator
math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
@ -24,7 +24,7 @@ math_infer_cfg = dict(
math_eval_cfg = dict(
evaluator=dict(type=MATHEvaluator),
evaluator=dict(type=MATHVerifyEvaluator),
)
math_datasets = [

View File

@ -2,7 +2,7 @@ from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import MATHDataset
from opencompass.openicl.icl_evaluator import MATHEvaluator
from opencompass.evaluator import MATHVerifyEvaluator
math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
@ -24,7 +24,7 @@ math_infer_cfg = dict(
inferencer=dict(type=GenInferencer))
math_eval_cfg = dict(
evaluator=dict(type=MATHEvaluator)
evaluator=dict(type=MATHVerifyEvaluator)
)
math_datasets = [

View File

@ -1,7 +1,7 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import MATHEvaluator
from opencompass.evaluator import MATHVerifyEvaluator
from opencompass.datasets import (
MATHDataset,
math_postprocess_v2,
@ -28,7 +28,7 @@ math_infer_cfg = dict(
# postprocess v2
math_eval_cfg = dict(
evaluator=dict(type=MATHEvaluator)
evaluator=dict(type=MATHVerifyEvaluator)
)
math_datasets = [

View File

@ -0,0 +1,127 @@
"""
Setting: 0-shot No-CoT
Evaluator: GenericLLMEvaluator
"""
from mmengine.config import read_base
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import MMLUDataset
from opencompass.utils.text_postprocessors import match_answer_pattern
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.evaluator import (
CascadeEvaluator,
GenericLLMEvaluator,
)
with read_base():
# from .....configs.datasets.mmlu.mmlu_all_sets import mmlu_all_sets
from .mmlu_stem_sets import mmlu_all_sets
# None of the mmlu dataset in huggingface is correctly parsed, so we use our own dataset reader
# Please download the dataset from https://people.eecs.berkeley.edu/~hendrycks/data.tar
QUERY_TEMPLATE = """
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of ABCD.
{input}
A) {A}
B) {B}
C) {C}
D) {D}
""".strip()
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: {input}\n A) {A}\n B) {B}\n C) {C}\n D) {D}\n<Original Question End>\n\n
<Gold Target Begin>: \n{target}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
mmlu_reader_cfg = dict(
input_columns=['input', 'A', 'B', 'C', 'D'],
output_column='target',
train_split='dev')
mmlu_datasets = []
for name in mmlu_all_sets:
mmlu_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt=QUERY_TEMPLATE),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
mmlu_eval_cfg = dict(
evaluator=dict(
type=CascadeEvaluator,
rule_evaluator=dict(
type=AccEvaluator,
pred_postprocessor=dict(type=match_answer_pattern, answer_pattern=r'(?i)ANSWER\s*:\s*([A-D])'),
),
llm_evaluator = dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
],
round=[
dict(
role='HUMAN',
prompt = GRADER_TEMPLATE
),
]),
),
dataset_cfg=dict(
abbr=f'lukaemon_mmlu_{name}',
type=MMLUDataset,
path='opencompass/mmlu',
name=name,
reader_cfg=mmlu_reader_cfg,
),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
judge_cfg=dict(),
),
parallel=False
),
)
mmlu_datasets.append(
dict(
abbr=f'lukaemon_mmlu_{name}',
type=MMLUDataset,
path='opencompass/mmlu',
name=name,
reader_cfg=mmlu_reader_cfg,
infer_cfg=mmlu_infer_cfg,
eval_cfg=mmlu_eval_cfg,
mode='singlescore',
))

View File

@ -1,30 +1,46 @@
"""
Summary: A config for OmniMath Dataset Evaluation.
Setting:
Shot: 0-shot
Evaluator:
- CascadeEvaluator
- MATHVerifyEvaluator
- GenericLLMEvaluator
Repeat: 1
Avaliable Models:
- Instruct/Chat Models
"""
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import Aime2024Dataset, MATHEvaluator, math_postprocess_v2
from opencompass.openicl.icl_evaluator import LMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.datasets.omni_math import OmniMathDataset
from opencompass.evaluator import (
CascadeEvaluator,
GenericLLMEvaluator,
MATHVerifyEvaluator,
)
aime2024_reader_cfg = dict(
input_columns=['question'],
omnimath_reader_cfg = dict(
input_columns=['problem'],
output_column='answer'
)
aime2024_infer_cfg = dict(
omnimath_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{question}\nRemember to put your final answer within \\boxed{}.'),
],
dict(role='HUMAN', prompt='please answer the following mathematical question, put your final answer in \\boxed{}.\n\n{problem}'),
]
)
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=2048)
inferencer=dict(type=GenInferencer)
)
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
@ -43,16 +59,20 @@ GRADER_TEMPLATE = """
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{question}\n<Original Question End>\n\n
<Original Question Begin>: \n{problem}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
aime2024_eval_cfg = dict(
evaluator=dict(
type=LMEvaluator,
cascade_evaluator = dict(
type=CascadeEvaluator,
rule_evaluator=dict(
type=MATHVerifyEvaluator,
),
llm_evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
@ -69,19 +89,27 @@ aime2024_eval_cfg = dict(
),
]),
),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
dataset_cfg=dict(
type=OmniMathDataset,
reader_cfg=omnimath_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
parallel=False,
)
aime2024_datasets = [
omnimath_eval_cfg = dict(
evaluator=cascade_evaluator,
)
omnimath_datasets = [
dict(
abbr='aime2024',
type=Aime2024Dataset,
path='opencompass/aime2024',
reader_cfg=aime2024_reader_cfg,
infer_cfg=aime2024_infer_cfg,
eval_cfg=aime2024_eval_cfg,
mode='singlescore',
type=OmniMathDataset,
abbr='OmniMath',
reader_cfg=omnimath_reader_cfg,
infer_cfg=omnimath_infer_cfg,
eval_cfg=omnimath_eval_cfg,
n=1,
)
]

View File

@ -1,18 +1,19 @@
from mmengine.config import read_base
with read_base():
from .groups.agieval import agieval_summary_groups
from .groups.mmlu import mmlu_summary_groups
from .groups.cmmlu import cmmlu_summary_groups
from .groups.ceval import ceval_summary_groups
from .groups.bbh import bbh_summary_groups
from .groups.GaokaoBench import GaokaoBench_summary_groups
from .groups.flores import flores_summary_groups
from .groups.tydiqa import tydiqa_summary_groups
from .groups.xiezhi import xiezhi_summary_groups
from .groups.scibench import scibench_summary_groups
from .groups.mgsm import mgsm_summary_groups
from .groups.longbench import longbench_summary_groups
# with read_base():
# pass
# from .groups.agieval import agieval_summary_groups
# from .groups.mmlu import mmlu_summary_groups
# from .groups.cmmlu import cmmlu_summary_groups
# from .groups.ceval import ceval_summary_groups
# from .groups.bbh import bbh_summary_groups
# from .groups.GaokaoBench import GaokaoBench_summary_groups
# from .groups.flores import flores_summary_groups
# from .groups.tydiqa import tydiqa_summary_groups
# from .groups.xiezhi import xiezhi_summary_groups
# from .groups.scibench import scibench_summary_groups
# from .groups.mgsm import mgsm_summary_groups
# from .groups.longbench import longbench_summary_groups
summarizer = dict(
summary_groups=sum([v for k, v in locals().items() if k.endswith('_summary_groups')], []),

View File

@ -3,6 +3,9 @@ from typing import Dict, List, Optional, Union
from datasets import Dataset, DatasetDict, concatenate_datasets
from opencompass.openicl import DatasetReader
from opencompass.utils import get_logger
logger = get_logger()
class BaseDataset:

View File

@ -173,44 +173,76 @@ class korbenchEvaluator(BaseEvaluator):
def __init__(self):
super().__init__()
def score(self, predictions, references, test_set):
"""Evaluate predictions for a single prompt_mode in KOR-Bench."""
if not test_set:
raise ValueError('Test set is empty.')
def sample_score(self, prediction, reference, test_item=None):
"""Evaluate a single sample.
prompt_mode = test_set[0][
'prompt_mode'] # Determine the prompt_mode from the first entry
data = {}
Args:
prediction: The model's prediction
reference: The reference answer
test_item: Additional information about the test sample
# Organize data for the given prompt_mode
for i in range(len(predictions)):
entry = {
'prediction': predictions[i],
'gold': references[i],
'rule_id': test_set[i].get('rule_id', None),
'category': test_set[i].get('category', None),
'rule_list': test_set[i].get('rule_list', None),
'question_list': test_set[i].get('question_list', None),
'base_path': test_set[i].get('base_path', None),
}
data[i] = entry
Returns:
Dict: A dictionary containing evaluation results
"""
if test_item is None:
raise ValueError('Test item is required.')
if not data:
raise ValueError(f"No data found for prompt_mode '{prompt_mode}'")
prompt_mode = test_item.get('prompt_mode')
# Evaluate based on the prompt_mode
# Build data for a single sample
entry = {
'prediction': prediction,
'gold': reference,
'rule_id': test_item.get('rule_id', None),
'category': test_item.get('category', None),
'rule_list': test_item.get('rule_list', None),
'question_list': test_item.get('question_list', None),
'base_path': test_item.get('base_path', None),
}
# Evaluate the single sample
data = {0: entry}
# Evaluate based on different prompt_mode
if prompt_mode == '0_shot':
evaluation_results = evaluate_responses(data, '0_shot')
elif prompt_mode == '3_shot':
evaluation_results = evaluate_responses(data, '3_shot')
elif prompt_mode in ['Multi-Q', 'Multi-R', 'Multi-RQ', 'mixed']:
evaluation_results = evaluate_responses(data, 'mixed',
test_set[0]['base_path'])
test_item.get('base_path'))
else:
raise ValueError(f'Unsupported prompt_mode: {prompt_mode}')
# Calculate accuracy
correct_count = sum(res['is_correct'] for res in evaluation_results)
accuracy = (correct_count / len(evaluation_results)) * 100
return {
'is_correct': False,
'pred': prediction,
'answer': reference
}
# Return scores
return {'accuracy': accuracy}
# Return evaluation results
result = evaluation_results[0]
result['correct'] = result['is_correct']
result.update({'pred': prediction, 'answer': reference})
return result
def score(self, predictions, references, test_set):
"""Evaluate each sample using sample_score."""
if not test_set:
raise ValueError('Test set is empty.')
details = []
correct_count = 0
# Call sample_score for each sample
for i in range(len(predictions)):
result = self.sample_score(predictions[i], references[i],
test_set[i])
details.append(result)
if result.get('is_correct', False):
correct_count += 1
# Calculate accuracy
accuracy = (correct_count /
len(predictions)) * 100 if predictions else 0
# Return evaluation results
return {'accuracy': accuracy, 'details': details}

View File

@ -204,7 +204,11 @@ def math_postprocess_v2(text: str) -> str:
@ICL_EVALUATORS.register_module()
class MATHEvaluator(BaseEvaluator):
def __init__(self, version='v1'):
def __init__(self,
version='v1',
pred_postprocessor=None): # 可能需要接收父类__init__的参数
super().__init__(
pred_postprocessor=pred_postprocessor) # 调用父类的__init__
assert version in ['v1', 'v2']
self.version = version

View File

@ -280,7 +280,11 @@ class MusrDataset(BaseDataset):
@ICL_EVALUATORS.register_module()
class MusrEvaluator(BaseEvaluator):
def __init__(self, answer_index_modifier=1, self_consistency_n=1):
def __init__(self,
answer_index_modifier=1,
self_consistency_n=1,
pred_postprocessor=None):
super().__init__(pred_postprocessor=pred_postprocessor)
self.answer_index_modifier = answer_index_modifier
self.self_consistency_n = self_consistency_n

View File

@ -76,7 +76,6 @@ class ReviewEvaluator:
pred_data = data_sample.pred
if pred_data is not None:
# import pdb; pdb.set_trace()
metrics_result['review_quality'] = 1.0 if pred_data == \
data_sample.gt else 0.0
metrics_result['parse_rate'] = 1.0

View File

@ -1,2 +1,3 @@
from .cascade_evaluator import CascadeEvaluator # noqa
from .generic_llm_evaluator import GenericLLMEvaluator # noqa
from .math_evaluator import MATHVerifyEvaluator # noqa

View File

@ -34,7 +34,8 @@ class CascadeEvaluator(BaseEvaluator):
sample_score_fn: Optional[Callable] = None,
parallel: bool = True,
) -> None:
self.logger = get_logger()
super().__init__()
self.logger = get_logger(__name__)
# Initialize the LLM evaluator
llm_evaluator_type = llm_evaluator.pop('type')
@ -58,7 +59,10 @@ class CascadeEvaluator(BaseEvaluator):
raise ValueError(
'Either rule_evaluator or sample_score_fn must be provided')
def sample_score(self, prediction: str, reference: str) -> Dict[str, Any]:
def sample_score(self,
prediction: str,
reference: str,
test_set=None) -> Dict[str, Any]:
"""Score a single sample using sample_score_fn or rule_evaluator.
Args:
@ -70,7 +74,7 @@ class CascadeEvaluator(BaseEvaluator):
"""
if self.sample_score_fn:
# Use user-provided function to evaluate a single sample
result = self.sample_score_fn(prediction, reference)
result = self.sample_score_fn(prediction, reference, test_set)
if not isinstance(result, dict):
# Ensure result is a dictionary with at least 'correct' field
result = {
@ -82,7 +86,8 @@ class CascadeEvaluator(BaseEvaluator):
else:
# Use rule_evaluator to evaluate a single sample by calling
# the score method with single-element lists
result = self.rule_evaluator.score([prediction], [reference])
result = self.rule_evaluator.score([prediction], [reference],
[test_set])
if 'details' in result and len(result['details']) > 0:
return result['details'][0]
else:
@ -137,7 +142,14 @@ class CascadeEvaluator(BaseEvaluator):
failed_indices = []
for i, (pred, ref) in enumerate(zip(predictions, references)):
result = self.sample_score(pred, ref)
if test_set is not None:
test_item = test_set[i]
else:
test_item = None
# Apply prediction postprocessing for each sample
[pred] = self.rule_evaluator.pred_postprocess([pred])
result = self.sample_score(pred, ref, test_item)
result['evaluation_method'] = 'rule'
details.append({'rule_evaluation': result})
@ -181,8 +193,11 @@ class CascadeEvaluator(BaseEvaluator):
original_out_dir = getattr(self.llm_evaluator, '_out_dir', None)
self.llm_evaluator._out_dir = f'{self._out_dir}_llm_judge'
# Generate random hash suffix
llm_results_path = f'{self.llm_evaluator._out_dir}_replica{self.dataset_replica_idx}.json' # noqa
self.logger.info(f'LLM evaluation results will be saved at '
f'{llm_results_path}')
# Check if results already exist to avoid re-evaluation
llm_results_path = f'{self.llm_evaluator._out_dir}.json'
if os.path.exists(llm_results_path):
self.logger.info(
f'Loading existing LLM evaluation results from '
@ -212,7 +227,15 @@ class CascadeEvaluator(BaseEvaluator):
# Use GenericLLMEvaluator to evaluate samples
# unset dataset_cfg for GenericLLMEvaluator to
# directly use test_set
# self.llm_evaluator.output_path = llm_results_path
self.llm_evaluator._dataset_replica_idx = \
self._dataset_replica_idx
self.llm_evaluator.dataset_cfg = None
# Apply prediction postprocessing to for LLM evaluator
failed_predictions = self.llm_evaluator.pred_postprocess(
failed_predictions)
llm_results = self.llm_evaluator.score(
predictions=failed_predictions,
references=failed_references,
@ -235,6 +258,9 @@ class CascadeEvaluator(BaseEvaluator):
# Update the details for samples that were evaluated by LLM
for i, llm_detail in enumerate(llm_details.values()):
# Add dataset replica index to LLM evaluation result
llm_detail['dataset_replica_idx'] = self.dataset_replica_idx
original_index = failed_indices[i]
# Store original rule-based evaluation result
rule_result = details[original_index].copy()
@ -283,6 +309,16 @@ class CascadeEvaluator(BaseEvaluator):
f'LLM evaluation: {llm_correct}/{llm_evaluated} '
f'correct ({llm_accuracy:.2f}%)')
# Append cascade correctness flag to each sample
for item in details:
_rule_correct = item['rule_evaluation'].get('correct', False)
if 'llm_evaluation' in item:
_llm_correct = item['llm_evaluation'].get(
'llm_correct', False)
else:
_llm_correct = False
item['cascade_correct'] = _rule_correct or _llm_correct
result = {
'accuracy': final_accuracy,
'cascade_stats': {

View File

@ -1,5 +1,6 @@
import os
import os.path as osp
from copy import deepcopy
from typing import Dict, List, Optional
import mmengine
@ -14,6 +15,8 @@ from opencompass.registry import (DICT_POSTPROCESSORS, ICL_PROMPT_TEMPLATES,
from opencompass.utils import build_dataset_from_cfg, build_model_from_cfg
from opencompass.utils.logging import get_logger
logger = get_logger(__name__)
class GenericLLMEvaluator(BaseEvaluator):
"""Generic LLM evaluator.
@ -23,6 +26,7 @@ class GenericLLMEvaluator(BaseEvaluator):
judge_cfg (ConfigDict): The config for Judge LLM.
dataset_cfg (ConfigDict): The config for dataset.
pred_postprocessor (ConfigDict): The config for postprocessor.
used for the prediction results.
dict_postprocessor (ConfigDict): The config for postprocessor,
used for evaluation results dict.
"""
@ -36,8 +40,7 @@ class GenericLLMEvaluator(BaseEvaluator):
dict_postprocessor: Optional[ConfigDict] = None,
keep_predictions: bool = False,
) -> None:
self.logger = get_logger()
super().__init__(pred_postprocessor=pred_postprocessor)
# If judge_cfg is not provided, fall back to the default configuration
if not judge_cfg:
self.judge_cfg = self.default_judge_cfg
@ -54,14 +57,14 @@ class GenericLLMEvaluator(BaseEvaluator):
self.dict_postprocessor = dict_postprocessor
self.pred_postprocessor = pred_postprocessor
def build_inferencer(self, ):
def build_inferencer(self):
"""Build LLM Inference."""
output_path = self._out_dir
self.output_path = f'{output_path}.json'
out_dir, out_name = osp.split(output_path)
out_name = f'{out_name}.json'
self.logger.info(
self.output_path = f'{self._out_dir}_replica{self.dataset_replica_idx}.json' # noqa
logger.info(f'LLM judge details will be saved at:{self.output_path}')
out_dir, out_name = osp.split(self.output_path)
logger.info(
f'Set self.output_path to {self.output_path} for current task')
assert self.output_path is not None, 'output_path is None'
@ -98,7 +101,6 @@ class GenericLLMEvaluator(BaseEvaluator):
# -------------- Build Inferencer ----------------
self.build_inferencer()
# ---------------- Process Predictions ------------------
predictions = self.pred_postprocess(predictions)
@ -178,7 +180,7 @@ class GenericLLMEvaluator(BaseEvaluator):
if self.dict_postprocessor is None:
return output
else:
kwargs = self.dict_postprocessor
kwargs = deepcopy(self.dict_postprocessor)
proc = DICT_POSTPROCESSORS.get(kwargs.pop('type'))
sig = inspect.signature(proc)
if 'dataset' in sig.parameters:
@ -192,7 +194,8 @@ class GenericLLMEvaluator(BaseEvaluator):
@property
def default_judge_cfg(self):
from opencompass.models import OpenAISDK
logger.info('Please set your judge model in `OC_JUDGE_MODEL`, \
`OC_JUDGE_API_KEY`, `OC_JUDGE_API_BASE` environment variables.')
DEFAULT_JUDGE_CFG = dict(
type=OpenAISDK,
path=os.environ['OC_JUDGE_MODEL'],

View File

@ -3,9 +3,9 @@ from opencompass.registry import ICL_EVALUATORS
@ICL_EVALUATORS.register_module()
class MATHEvaluator(BaseEvaluator):
class MATHVerifyEvaluator(BaseEvaluator):
def score(self, predictions, references):
def score(self, predictions, references, test_set=None):
try:
from latex2sympy2_extended import NormalizationConfig
from math_verify import (ExprExtractionConfig,

View File

@ -556,28 +556,27 @@ class OpenAI(BaseAPIModel):
class OpenAISDK(OpenAI):
def __init__(
self,
path: str = 'gpt-3.5-turbo',
max_seq_len: int = 16384,
query_per_second: int = 1,
rpm_verbose: bool = False,
retry: int = 2,
key: str | List[str] = 'ENV',
org: str | List[str] | None = None,
meta_template: Dict | None = None,
openai_api_base: str | List[str] = OPENAISDK_API_BASE,
openai_proxy_url: Optional[str] = None,
mode: str = 'none',
logprobs: bool | None = False,
top_logprobs: int | None = None,
temperature: float | None = None,
tokenizer_path: str | None = None,
extra_body: Dict | None = None,
verbose: bool = False,
status_code_mappings: dict = {},
think_tag: str = '</think>',
):
def __init__(self,
path: str = 'gpt-3.5-turbo',
max_seq_len: int = 16384,
query_per_second: int = 1,
rpm_verbose: bool = False,
retry: int = 2,
key: str | List[str] = 'ENV',
org: str | List[str] | None = None,
meta_template: Dict | None = None,
openai_api_base: str | List[str] = OPENAISDK_API_BASE,
openai_proxy_url: Optional[str] = None,
mode: str = 'none',
logprobs: bool | None = False,
top_logprobs: int | None = None,
temperature: float | None = None,
tokenizer_path: str | None = None,
extra_body: Dict | None = None,
verbose: bool = False,
http_client_cfg: dict = {},
status_code_mappings: dict = {},
think_tag: str = '</think>'):
super().__init__(
path,
max_seq_len,
@ -605,20 +604,20 @@ class OpenAISDK(OpenAI):
else:
self.openai_api_base = openai_api_base
if self.proxy_url is None:
self.openai_client = OpenAI(base_url=self.openai_api_base,
api_key=key)
else:
proxies = {
'http://': self.proxy_url,
'https://': self.proxy_url,
}
if self.proxy_url or http_client_cfg:
if self.proxy_url:
http_client_cfg['proxies'] = {
'http://': self.proxy_url,
'https://': self.proxy_url,
}
self.openai_client = OpenAI(
base_url=self.openai_api_base,
api_key=key,
http_client=httpx.Client(
**http_client_cfg) if http_client_cfg else None,
)
self.openai_client = OpenAI(
base_url=self.openai_api_base,
api_key=key,
http_client=httpx.Client(proxies=proxies),
)
if self.verbose:
self.logger.info(f'Used openai_client: {self.openai_client}')
self.status_code_mappings = status_code_mappings
@ -679,6 +678,7 @@ class OpenAISDK(OpenAI):
try:
if self.verbose:
self.logger.info('Start calling OpenAI API')
responses = self.openai_client.chat.completions.create(
**query_data, timeout=timeout) # timeout in seconds
if self.verbose:
@ -689,7 +689,6 @@ class OpenAISDK(OpenAI):
self.logger.info(responses)
except Exception:
pass # noqa F841
# Check if response is empty or content is empty
if (not responses.choices or not responses.choices[0].message
or

View File

@ -14,4 +14,3 @@ from .icl_misc_evaluator import AveragePPLEvaluator # noqa
from .icl_plugin_evaluator import TEvalEvaluator # noqa
from .icl_toxic_evaluator import ToxicEvaluator # noqa
from .lm_evaluator import LMEvaluator # noqa
from .math_evaluator import MATHEvaluator # noqa

View File

@ -8,6 +8,11 @@ import numpy as np
from datasets import Dataset
from scipy.stats import hypergeom
from opencompass.registry import TEXT_POSTPROCESSORS
from opencompass.utils.logging import get_logger
logger = get_logger(__name__)
def compute_pass_at_k(n, c, k):
if n - c < k:
@ -39,14 +44,19 @@ def compute_mg_pass_at_k(n, c, k):
class BaseEvaluator:
def __init__(self) -> None:
pass
def __init__(self, pred_postprocessor=None) -> None:
self.pred_postprocessor = pred_postprocessor
self._dataset_replica_idx = 0 # Default value for dataset_replica_idx
@property
def output_dir(self):
# please see opencompass/opencompass/tasks/openicl_eval.py Line 197-200
return self._out_dir
@property
def dataset_replica_idx(self):
return self._dataset_replica_idx
def group(self, n: int, details: List[Dict[str, Any]],
test_set: Dataset) -> Dict[str, Any]:
example2replications = {}
@ -82,6 +92,15 @@ class BaseEvaluator:
[detail[metric] for detail in details])
return g_passk_details
def pred_postprocess(self, predictions: List) -> Dict:
if not hasattr(
self, 'pred_postprocessor') or self.pred_postprocessor is None:
return predictions
else:
kwargs = deepcopy(self.pred_postprocessor)
proc = TEXT_POSTPROCESSORS.get(kwargs.pop('type'))
return [proc(pred, **kwargs) for pred in predictions]
def evaluate(
self,
k: Union[int, List[int]],
@ -98,10 +117,14 @@ class BaseEvaluator:
raise ValueError(
'Predictions and references must have the same length')
real_size = len(original_dataset) // n
real_size = len(original_dataset) // n # dataset size of each replica
all_details = []
all_results = []
# Run evaluation for each replica
for i in range(n):
self._dataset_replica_idx = i
logger.info(f'Running {i}-th replica of evaluation')
def select_fn(i, real_size, x):
if isinstance(x, Dataset):
@ -111,11 +134,14 @@ class BaseEvaluator:
else:
return x
results = self.score(
**{
key: select_fn(i, real_size, value)
for key, value in score_kwargs.items()
})
current_params = {
key: select_fn(i, real_size, value)
for key, value in score_kwargs.items()
}
current_params['predictions'] = self.pred_postprocess(
current_params['predictions'])
results = self.score(**current_params)
details = results.pop('details', None)
if details is not None:
if isinstance(details, Dict):
@ -124,11 +150,11 @@ class BaseEvaluator:
all_results.append(results)
eval_results = {}
for single_results in all_results:
for key in single_results:
for single_replica_results in all_results:
for key in single_replica_results:
if key not in eval_results:
eval_results[key] = []
eval_results[key].append(single_results[key])
eval_results[key].append(single_replica_results[key])
for key in deepcopy(eval_results):
if isinstance(eval_results[key][0], float) or isinstance(
eval_results[key][0], int):
@ -138,9 +164,8 @@ class BaseEvaluator:
eval_results.pop(key)
else:
eval_results[key] = np.mean(eval_results[key])
else:
eval_results[key] = eval_results[key][0]
# Calculate the additional metrics
grouped_examples = self.group(n, all_details, original_dataset)
can_calculate = False
if len(all_details) != 0:
@ -158,6 +183,10 @@ class BaseEvaluator:
elif example['detail'].get('is_correct', None) is not None:
can_calculate = True
c += int(example['detail']['is_correct'])
elif example['detail'].get('cascade_correct',
None) is not None:
can_calculate = True
c += int(example['detail']['cascade_correct'])
k_list = [k] if isinstance(k, int) else k
if can_calculate and n > 1 and max(k_list) > 1:

View File

@ -1,10 +1,11 @@
import os
import random
from typing import List
from typing import List, Optional
import evaluate
import numpy as np
from datasets import Dataset
from mmengine.config import ConfigDict
from opencompass.registry import ICL_EVALUATORS
@ -19,12 +20,17 @@ class HuggingfaceEvaluator(BaseEvaluator):
seed (int): There exists some randomness during the calculation of some
metrics, thus we set a fixed random seed for reproducing. Defaults
to 0.
pred_postprocessor (optional): Function or configuration for prediction
post-processing.
"""
def __init__(self, metric: str, seed: int = 0) -> None:
def __init__(self,
metric: str,
seed: int = 0,
pred_postprocessor=None) -> None:
self.metric = metric
self.seed = seed
super().__init__()
super().__init__(pred_postprocessor=pred_postprocessor)
def _preprocess(self, predictions: List, references: List) -> dict:
"""Preprocess the final predictions and references to needed format.
@ -52,7 +58,10 @@ class HuggingfaceEvaluator(BaseEvaluator):
"""
return scores
def score(self, predictions: List, references: List) -> dict:
def score(self,
predictions: List,
references: List,
test_set=None) -> dict:
"""Calculate scores.
Args:
@ -92,10 +101,15 @@ class HuggingfaceEvaluator(BaseEvaluator):
class AccEvaluator(HuggingfaceEvaluator):
"""Accuracy evaluator."""
def __init__(self) -> None:
super().__init__(metric='accuracy')
def __init__(self,
pred_postprocessor: Optional[ConfigDict] = None) -> None:
super().__init__(metric='accuracy',
pred_postprocessor=pred_postprocessor)
def _preprocess(self, predictions: List, references: List) -> dict:
def _preprocess(self,
predictions: List,
references: List,
test_set=None) -> dict:
"""Preprocess the final predictions and references to needed format.
Args:
@ -187,8 +201,9 @@ class RougeEvaluator(HuggingfaceEvaluator):
Note: this evaluator is not suitable for chinese datasets.
"""
def __init__(self) -> None:
super().__init__(metric='rouge')
def __init__(self,
pred_postprocessor: Optional[ConfigDict] = None) -> None:
super().__init__(metric='rouge', pred_postprocessor=pred_postprocessor)
def _postprocess(self, scores: dict) -> dict:
"""Postprocess for final scores.
@ -206,8 +221,10 @@ class RougeEvaluator(HuggingfaceEvaluator):
class BleuEvaluator(HuggingfaceEvaluator):
"""Bleu evaluator."""
def __init__(self) -> None:
super().__init__(metric='sacrebleu')
def __init__(self,
pred_postprocessor: Optional[ConfigDict] = None) -> None:
super().__init__(metric='sacrebleu',
pred_postprocessor=pred_postprocessor)
class BleuFloresEvaluator(HuggingfaceEvaluator):

View File

@ -26,6 +26,7 @@ class NumWorkerPartitioner(BasePartitioner):
dataset_size_path (str): The path to the dataset size cache file.
keep_keys (list[str]): The keys to be kept from the experiment config
to the task config.
force_rebuild (bool): Whether to force rebuild dataset to get size.
"""
def __init__(self,
@ -35,7 +36,8 @@ class NumWorkerPartitioner(BasePartitioner):
min_task_size: int = 16,
strategy: str = 'heuristic',
dataset_size_path: str = '.cache/dataset_size.json',
keep_keys: Optional[List[str]] = None):
keep_keys: Optional[List[str]] = None,
force_rebuild: bool = False):
super().__init__(out_dir=out_dir, keep_keys=keep_keys)
if strategy == 'split' and num_worker is not None:
self.logger.warning('num_worker is ignored with split.')
@ -44,6 +46,7 @@ class NumWorkerPartitioner(BasePartitioner):
self.num_split = num_split or num_worker
self.min_task_size = min_task_size
self.dataset_size_path = dataset_size_path
self.force_rebuild = force_rebuild
assert strategy in ('heuristic', 'split'), \
f'Unsupported partition strategy: {strategy}. '\
'Supported strategies are: `heuristic`, `split` .'
@ -106,7 +109,7 @@ class NumWorkerPartitioner(BasePartitioner):
@property
def dataset_size(self):
if not hasattr(self, '_dataset_size'):
if osp.exists(self.dataset_size_path):
if not self.force_rebuild and osp.exists(self.dataset_size_path):
self._dataset_size = mmengine.load(self.dataset_size_path)
else:
self._dataset_size = {}
@ -130,22 +133,25 @@ class NumWorkerPartitioner(BasePartitioner):
def get_size(self, dataset: ConfigDict) -> int:
dataset_abbr = dataset_abbr_from_cfg(dataset)
test_range = dataset.reader_cfg.get('test_range', '')
if dataset_abbr in self.dataset_size:
# If not forcing rebuild and data exists in cache, use the cache
if not self.force_rebuild and dataset_abbr in self.dataset_size:
actual_size = eval('len(range(self.dataset_size[dataset_abbr])'
f'{test_range})')
return actual_size
# Otherwise, rebuild the dataset to get its size
dataset = build_dataset_from_cfg(dataset)
self.dataset_size[dataset_abbr] = len(dataset.test)
mmengine.mkdir_or_exist('.cache/')
mmengine.dump(self.dataset_size,
self.dataset_size_path,
indent=4,
ensure_ascii=False)
# Save to cache file
if self.dataset_size_path:
mmengine.mkdir_or_exist('.cache/')
mmengine.dump(self.dataset_size,
self.dataset_size_path,
indent=4,
ensure_ascii=False)
actual_size = eval('len(range(self.dataset_size[dataset_abbr])'
f'{test_range})')

View File

@ -146,11 +146,16 @@ class OpenICLEvalTask(BaseTask):
preds = []
i = 1
while osp.exists(osp.realpath(filename)):
sub_preds = mmengine.load(filename)
preds.extend(
[sub_preds[str(i)] for i in range(len(sub_preds))])
filename = root + f'_{i}' + ext
i += 1
try:
sub_preds = mmengine.load(filename)
preds.extend(
[sub_preds[str(i)] for i in range(len(sub_preds))])
filename = root + f'_{i}' + ext
i += 1
except Exception as e:
self.logger.error(
f'Error loading prediction file {filename}: {e}')
break
pred_dicts = copy.deepcopy(preds)
preds = {k: [pred.get(k) for pred in preds] for k in preds[0]}

View File

@ -2,6 +2,8 @@ import logging
import os
from mmengine.logging import MMLogger
from rich.console import Console
from rich.syntax import Syntax
_nameToLevel = {
'CRITICAL': logging.CRITICAL,
@ -79,3 +81,14 @@ class FilterDuplicateMessage(logging.Filter):
self.seen.add(record.msg)
return True
return False
def pretty_print_config(cfg):
"""Pretty print config using the rich library."""
console = Console()
config_str = cfg.pretty_text
syntax = Syntax(config_str,
'python',
theme='solarized-dark',
line_numbers=True)
console.print(syntax)

View File

@ -150,6 +150,13 @@ def get_config_from_arg(args) -> Config:
dataset['meta_path'] = args.custom_dataset_meta_path
dataset = make_custom_dataset_config(dataset)
datasets.append(dataset)
## apply the dataset repeat runs
if len(datasets) > 0 and args.dataset_num_runs > 1:
logger.warning(f'User has set the --dataset-num-runs, the datasets will be evaluated with {args.dataset_num_runs} runs.')
for _dataset in datasets:
logger.warning(f"The default num runs of {_dataset['abbr']} is: {_dataset['n']}, changed into: {args.dataset_num_runs}")
_dataset['n'] = args.dataset_num_runs
_dataset['k'] = args.dataset_num_runs
# parse model args
if not args.models and not args.hf_path:
@ -204,7 +211,6 @@ def get_config_from_arg(args) -> Config:
summarizers_dir = [
os.path.join(args.config_dir, 'summarizers'),
os.path.join(default_configs_dir, './summarizers'),
]
# Check if summarizer_arg contains '/'
@ -308,7 +314,7 @@ def change_accelerator(models, accelerator):
model_kwargs=model_kwargs,
max_seq_len=model.get('max_seq_len', None),
max_out_len=model['max_out_len'],
batch_size=16,
batch_size=model.get('batch_size', 16),
run_cfg=model['run_cfg'],
stop_words=model.get('stop_words', []),
)
@ -335,7 +341,7 @@ def change_accelerator(models, accelerator):
gen_config=gen_config,
max_seq_len=model.get('max_seq_len', None),
max_out_len=model['max_out_len'],
batch_size=16,
batch_size=model.get('batch_size', 16),
run_cfg=model['run_cfg'],
stop_words=model.get('stop_words', []),
)