[Fix] Fix Math Evaluation with Judge Model Evaluator & Add README (#1103)

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Fix Llama-3 meta template

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

---------

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
This commit is contained in:
liushz 2024-04-28 21:58:58 +08:00 committed by GitHub
parent 0b7de67c4a
commit a6f67e1a65
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
6 changed files with 454 additions and 18 deletions

View File

@ -0,0 +1,37 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import MATHDataset, MATHEvaluator, math_postprocess
QUERY_TEMPLATE = """
Solve the following math problem step by step. The last line of your response should be of the form ANSWER: $ANSWER (without quotes) where $ANSWER is the answer to the problem.
{problem}
Remember to put your answer on its own line after "ANSWER:", and you do not need to use a \\boxed command.
""".strip()
math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
math_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(role="HUMAN", prompt=QUERY_TEMPLATE),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512))
math_eval_cfg = dict(
evaluator=dict(type=MATHEvaluator), pred_postprocessor=dict(type=math_postprocess))
math_datasets = [
dict(
type=MATHDataset,
abbr='math',
path='./data/math/math.json',
reader_cfg=math_reader_cfg,
infer_cfg=math_infer_cfg,
eval_cfg=math_eval_cfg)
]

View File

@ -19,7 +19,7 @@ math_infer_cfg = dict(
dict(role="HUMAN", prompt=QUERY_TEMPLATE),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512))
inferencer=dict(type=GenInferencer, max_out_len=1024))
math_eval_cfg = dict(
evaluator=dict(type=MATHEvaluator), pred_postprocessor=dict(type=math_postprocess))

View File

@ -2,10 +2,8 @@
from mmengine.config import read_base
with read_base():
from .models.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model # noqa: F401, F403
from .models.hf_internlm.hf_internlm2_chat_20b import models as hf_internlm2_chat_20b_model # noqa: F401, F403
from .models.hf_llama.hf_llama3_70b_instruct import models as hf_llama3_70b_instruct_model # noqa: F401, F403
from .datasets.math.math_llm_judge import math_datasets # noqa: F401, F403
from opencompass.models.openai_api import OpenAIAllesAPIN
from opencompass.datasets import math_judement_preprocess
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
@ -22,41 +20,69 @@ from opencompass.openicl.icl_prompt_template import PromptTemplate
# -------------Prompt Settings ----------------------------------------
eng_obj_prompt = """
Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications
Examples:
Expression 1: $2x+3$
Expression 2: $3+2x$
Result: [[Correct]]
[Yes]
Expression 1: 3/2
Expression 2: 1.5
Result: [[Correct]]
[Yes]
Expression 1: $x^2+2x+1$
Expression 2: $y^2+2y+1$
Result: [[Incorrect]]
[No]
Expression 1: $x^2+2x+1$
Expression 2: $(x+1)^2$
Result: [[Correct]]
[Yes]
Expression 1: 3245/5
Expression 2: 649
Result: [[Incorrect]]
[No]
(these are actually equal, don't mark them equivalent if you need to do nontrivial simplifications)
Expression 1: 2/(-3)
Expression 2: -2/3
Result: [[Correct]]
[Yes]
(trivial simplifications are allowed)
Expression 1: 72 degrees
Expression 2: 72
Result: [[Correct]]
[Yes]
(give benefit of the doubt to units)
Expression 1: 64
Expression 2: 64 square feet
Result: [[Correct]]
[Yes]
(give benefit of the doubt to units)
Expression 1: 64
Expression 2:
[No]
(only mark as equivalent if both expressions are nonempty)
---
YOUR TASK
Respond with only "Result: [[Correct]]" or "Result: [[Incorrect]]" (without quotes). Do not include a rationale.
Respond with only "[Yes]" or "[No]" (without quotes). Do not include a rationale.
Expression 1: {obj_gold}
Expression 2: {prediction}
""".strip()
Expression 2: {prediction}
"""
# -------------Inferen Stage ----------------------------------------
# eval models

View File

@ -0,0 +1,186 @@
# Using Large Models as JudgeLLM for Objective Evaluation
## Introduction
Traditional objective evaluations often rely on standard answers for reference. However, in practical applications, the predicted results of models may vary due to differences in the model's instruction-following capabilities or imperfections in post-processing functions. This can lead to incorrect extraction of answers and comparison with standard answers, resulting in potentially inaccurate evaluation outcomes. To address this issue, we have adopted a process similar to subjective evaluations by introducing JudgeLLM post-prediction to assess the consistency between model responses and standard answers. ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
Currently, all models supported by the opencompass repository can be directly used as JudgeLLM. Additionally, we are planning to support dedicated JudgeLLMs.
## Currently Supported Objective Evaluation Datasets
1. MATH ([https://github.com/hendrycks/math](https://github.com/hendrycks/math))
## Custom JudgeLLM Objective Dataset Evaluation
OpenCompass currently supports most datasets that use `GenInferencer` for inference. The specific process for custom JudgeLLM objective evaluation includes:
1. Building evaluation configurations using API models or open-source models for inference of question answers.
2. Employing a selected evaluation model (JudgeLLM) to assess the outputs of the model.
### Step One: Building Evaluation Configurations, Using MATH as an Example
Below is the Config for evaluating the MATH dataset with JudgeLLM, with the evaluation model being *Llama3-8b-instruct* and the JudgeLLM being *Llama3-70b-instruct*. For more detailed config settings, please refer to `configs/eval_math_llm_judge.py`. The following is a brief version of the annotations to help users understand the meaning of the configuration file.
```python
# Most of the code in this file is copied from https://github.com/openai/simple-evals/blob/main/math_eval.py
from mmengine.config import read_base
with read_base():
from .models.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model # noqa: F401, F403
from .models.hf_llama.hf_llama3_70b_instruct import models as hf_llama3_70b_instruct_model # noqa: F401, F403
from .datasets.math.math_llm_judge import math_datasets # noqa: F401, F403
from opencompass.datasets import math_judement_preprocess
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import AllObjSummarizer
from opencompass.openicl.icl_evaluator import LMEvaluator
from opencompass.openicl.icl_prompt_template import PromptTemplate
# ------------- Prompt Settings ----------------------------------------
# Evaluation template, please modify the template as needed, JudgeLLM typically uses [Yes] or [No] as the response. For the MATH dataset, the evaluation template is as follows:
eng_obj_prompt = """
Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications
Examples:
Expression 1: $2x+3$
Expression 2: $3+2x$
[Yes]
Expression 1: 3/2
Expression 2: 1.5
[Yes]
Expression 1: $x^2+2x+1$
Expression 2: $y^2+2y+1$
[No]
Expression 1: $x^2+2x+1$
Expression 2: $(x+1)^2$
[Yes]
Expression 1: 3245/5
Expression 2: 649
[No]
(these are actually equal, don't mark them equivalent if you need to do nontrivial simplifications)
Expression 1: 2/(-3)
Expression 2: -2/3
[Yes]
(trivial simplifications are allowed)
Expression 1: 72 degrees
Expression 2: 72
[Yes]
(give benefit of the doubt to units)
Expression 1: 64
Expression 2: 64 square feet
[Yes]
(give benefit of the doubt to units)
Expression 1: 64
Expression 2:
[No]
(only mark as equivalent if both expressions are nonempty)
---
YOUR TASK
Respond with only "[Yes]" or "[No]" (without quotes). Do not include a rationale.
Expression 1: {obj_gold}
Expression 2: {prediction}
"""
# ------------- Inference Phase ----------------------------------------
# Models to be evaluated
models = [*hf_llama3_8b_instruct_model]
# Evaluation models
judge_models = hf_llama3_70b_instruct_model
eng_datasets = [*math_datasets]
chn_datasets = []
datasets = eng_datasets + chn_datasets
for d in eng_datasets:
d['eval_cfg']= dict(
evaluator=dict(
type=LMEvaluator,
# If you need to preprocess model predictions before judging,
# you can specify a pred_postprocessor function here
pred_postprocessor=dict(type=math_judement_preprocess),
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt = eng_obj_prompt
),
]),
),
),
pred_role="BOT",
)
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=40000),
runner=dict(
type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLInferTask)),
)
# ------------- Evaluation Configuration --------------------------------
eval = dict(
partitioner=dict(
type=SubjectiveSizePartitioner, max_task_size=80000, mode='singlescore', models=models, judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(
type=AllObjSummarizer
)
# Output folder
work_dir = 'outputs/obj_all/'
```
### Step Two: Launch Evaluation and Output Results
```shell
python run.py eval_math_llm_judge.py
```
This will initiate two rounds of evaluation. The first round involves model inference to obtain predicted answers to questions, and the second round involves JudgeLLM evaluating the consistency between the predicted answers and the standard answers, and scoring them.
- The results of model predictions will be saved in `output/.../timestamp/predictions/xxmodel/xxx.json`
- The JudgeLLM's evaluation responses will be saved in `output/.../timestamp/results/xxmodel/xxx.json`
- The evaluation report will be output to `output/.../timestamp/summary/timestamp/xxx.csv`
## Results
Using the Llama3-8b-instruct as the evaluation model and the Llama3-70b-instruct as the evaluator, the MATH dataset was assessed with the following results:
| Model | JudgeLLM Evaluation | Naive Evaluation |
| ------------------- | ------------------- | ---------------- |
| llama-3-8b-instruct | 27.7 | 27.8 |

View File

@ -0,0 +1,186 @@
# 用大模型做为JudgeLLM进行客观评测
## 介绍
通常的客观评测虽有标准答案作为参考但是在实际应用中模型预测结果可能因为模型指令遵循能力不同或后处理函数的不完善而产生差异导致无法抽取到正确的答案并与标准答案进行对比。因此客观评测的结果可能并不完全准确。为了解决这一问题我们参照主观评测在预测完成后引入了JudgeLLM作为评价模型以评估模型回答和标准答案的一致性。[LLM-as-a-Judge](https://arxiv.org/abs/2306.05685))。
目前opencompass仓库里支持的所有模型都可以直接作为JudgeLLM进行调用此外一些专用的JudgeLLM我们也在计划支持中。
## 目前已支持的用JudgeLLM进行直接评测的客观评测数据集
1. MATHhttps://github.com/hendrycks/math
## 自定义JudgeLLM客观数据集评测
目前的OpenCompass支持大部分采用`GenInferencer`的数据集进行推理。自定义JudgeLLM客观评测的具体流程包括:
1. 构建评测配置使用API模型或者开源模型进行问题答案的推理
2. 使用选定的评价模型(JudgeLLM)对模型输出进行评估
### 第一步构建评测配置以MATH为例
下面是对MATH数据集进行JudgeLLM评测的Config评测模型为*Llama3-8b-instruct*JudgeLLM为*Llama3-70b-instruct*。更详细的config setting请参考 `configs/eval_math_llm_judge.py`,下面我们提供了部分简略版的注释,方便用户理解配置文件的含义。
```python
# Most of the code in this file is copied from https://github.com/openai/simple-evals/blob/main/math_eval.py
from mmengine.config import read_base
with read_base():
from .models.hf_llama.hf_llama3_8b_instruct import models as hf_llama3_8b_instruct_model # noqa: F401, F403
from .models.hf_llama.hf_llama3_70b_instruct import models as hf_llama3_70b_instruct_model # noqa: F401, F403
from .datasets.math.math_llm_judge import math_datasets # noqa: F401, F403
from opencompass.datasets import math_judement_preprocess
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import AllObjSummarizer
from opencompass.openicl.icl_evaluator import LMEvaluator
from opencompass.openicl.icl_prompt_template import PromptTemplate
# ------------- Prompt设置 ----------------------------------------
# 评测模板请根据需要修改模板JudgeLLM默认采用[Yes]或[No]作为回答在MATH数据集中评测模板如下
eng_obj_prompt = """
Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications
Examples:
Expression 1: $2x+3$
Expression 2: $3+2x$
[Yes]
Expression 1: 3/2
Expression 2: 1.5
[Yes]
Expression 1: $x^2+2x+1$
Expression 2: $y^2+2y+1$
[No]
Expression 1: $x^2+2x+1$
Expression 2: $(x+1)^2$
[Yes]
Expression 1: 3245/5
Expression 2: 649
[No]
(these are actually equal, don't mark them equivalent if you need to do nontrivial simplifications)
Expression 1: 2/(-3)
Expression 2: -2/3
[Yes]
(trivial simplifications are allowed)
Expression 1: 72 degrees
Expression 2: 72
[Yes]
(give benefit of the doubt to units)
Expression 1: 64
Expression 2: 64 square feet
[Yes]
(give benefit of the doubt to units)
Expression 1: 64
Expression 2:
[No]
(only mark as equivalent if both expressions are nonempty)
---
YOUR TASK
Respond with only "[Yes]" or "[No]" (without quotes). Do not include a rationale.
Expression 1: {obj_gold}
Expression 2: {prediction}
"""
# -------------推理阶段 ----------------------------------------
# 需要评测的模型
models = [*hf_llama3_8b_instruct_model]
# 评价模型
judge_models = hf_llama3_70b_instruct_model
eng_datasets = [*math_datasets]
chn_datasets = []
datasets = eng_datasets + chn_datasets
for d in eng_datasets:
d['eval_cfg']= dict(
evaluator=dict(
type=LMEvaluator,
# 如果你需要在判断之前预处理模型预测,
# 你可以在这里指定pred_postprocessor函数
pred_postprocessor=dict(type=math_judement_preprocess),
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt = eng_obj_prompt
),
]),
),
),
pred_role="BOT",
)
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=40000),
runner=dict(
type=LocalRunner,
max_num_workers=256,
task=dict(type=OpenICLInferTask)),
)
# ------------- 评测配置 --------------------------------
eval = dict(
partitioner=dict(
type=SubjectiveSizePartitioner, max_task_size=80000, mode='singlescore', models=models, judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(
type=AllObjSummarizer
)
# 输出文件夹
work_dir = 'outputs/obj_all/'
```
### 第二步 启动评测并输出评测结果
```shell
python run.py eval_math_llm_judge.py
```
此时会进行两轮评测第一轮是模型推理得到问题的预测答案第二轮是JudgeLLM评测预测答案和标准答案的一致性并打分。
- 模型预测的结果会保存在 `output/.../timestamp/predictions/xxmodel/xxx.json`
- JudgeLLM的评测回复会保存在 `output/.../timestamp/results/xxmodel/xxx.json`
- 评测报告则会输出到 `output/.../timestamp/summary/timestamp/xxx.csv`
## 评测结果
采用Llama3-8b-instruct作为评价模型Llama3-70b-instruct作为评价器对MATH数据集进行评价结果如下
| Model | JudgeLLM Evaluation | Naive Evaluation |
| ------------------- | ------------------- | ---------------- |
| llama-3-8b-instruct | 27.7 | 27.8 |

View File

@ -20,13 +20,13 @@ def post_process_allobj(judgement: str):
xxx[[correct]]xxx, and extract the judge
"""
pattern = r'(?i)\[(incorrect|correct|正确|错误)\]'
pattern = r'(?i)\[(incorrect|correct|正确|错误|Yes|No)\]'
matched_result = re.findall(pattern, judgement)
if matched_result:
content = matched_result[0].lower()
if content in ['correct', '正确']:
if content in ['correct', '正确', 'yes']:
return {'score': 1}
elif content in ['incorrect', '错误']:
elif content in ['incorrect', '错误', 'no']:
return {'score': 0}
else:
return None
@ -81,7 +81,8 @@ class AllObjSummarizer:
self.base_models = self.cfg['eval']['partitioner']['base_models']
self.compare_models = self.cfg['eval']['partitioner'][
'compare_models']
self.judge_abbr = model_abbr_from_cfg(self.cfg['judge_models'][0])
self.judge_abbr = model_abbr_from_cfg(
self.cfg['eval']['partitioner']['judge_models'][0])
self.judge_map = {'single': post_process_allobj}
self.judge_function = self.judge_map[self.judge_type]