mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00
264 lines
8.6 KiB
Markdown
264 lines
8.6 KiB
Markdown
# Subjective Evaluation Guidance
|
||
|
||
## Introduction
|
||
|
||
Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.
|
||
|
||
To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
|
||
|
||
A popular evaluation method involves
|
||
|
||
- Compare Mode: comparing model responses pairwise to calculate their win rate
|
||
- Score Mode: another method involves calculate scores with single model response ([Chatbot Arena](https://chat.lmsys.org/)).
|
||
|
||
We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of models based on above methods.
|
||
|
||
## Subjective Evaluation with Custom Dataset
|
||
|
||
The specific process includes:
|
||
|
||
1. Data preparation
|
||
2. Model response generation
|
||
3. Evaluate the response with a JudgeLLM
|
||
4. Generate JudgeLLM's response and calculate the metric
|
||
|
||
### Step-1: Data Preparation
|
||
|
||
We provide mini test-set for **Compare Mode** and **Score Mode** as below:
|
||
|
||
```python
|
||
###COREV2
|
||
[
|
||
{
|
||
"question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
|
||
"capability": "知识-社会常识",
|
||
"others": {
|
||
"question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
|
||
"evaluating_guidance": "",
|
||
"reference_answer": "上"
|
||
}
|
||
},...]
|
||
|
||
###CreationV0.1
|
||
[
|
||
{
|
||
"question": "请你扮演一个邮件管家,我让你给谁发送什么主题的邮件,你就帮我扩充好邮件正文,并打印在聊天框里。你需要根据我提供的邮件收件人以及邮件主题,来斟酌用词,并使用合适的敬语。现在请给导师发送邮件,询问他是否可以下周三下午15:00进行科研同步会,大约200字。",
|
||
"capability": "邮件通知",
|
||
"others": ""
|
||
},
|
||
```
|
||
|
||
The json must includes the following fields:
|
||
|
||
- 'question': Question description
|
||
- 'capability': The capability dimension of the question.
|
||
- 'others': Other needed information.
|
||
|
||
If you want to modify prompt on each single question, you can full some other information into 'others' and construct it.
|
||
|
||
### Step-2: Evaluation Configuration(Compare Mode)
|
||
|
||
For `config/eval_subjective_compare.py`, we provide some annotations to help users understand the configuration file.
|
||
|
||
```python
|
||
|
||
from mmengine.config import read_base
|
||
from opencompass.models import HuggingFaceCausalLM, HuggingFace, OpenAI
|
||
|
||
from opencompass.partitioners import NaivePartitioner
|
||
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
|
||
from opencompass.runners import LocalRunner
|
||
from opencompass.runners import SlurmSequentialRunner
|
||
from opencompass.tasks import OpenICLInferTask
|
||
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
|
||
from opencompass.summarizers import Corev2Summarizer
|
||
|
||
with read_base():
|
||
# Pre-defined models
|
||
from .models.qwen.hf_qwen_7b_chat import models as hf_qwen_7b_chat
|
||
from .models.chatglm.hf_chatglm3_6b import models as hf_chatglm3_6b
|
||
from .models.qwen.hf_qwen_14b_chat import models as hf_qwen_14b_chat
|
||
from .models.openai.gpt_4 import models as gpt4_model
|
||
from .datasets.subjective_cmp.subjective_corev2 import subjective_datasets
|
||
|
||
# Evaluation datasets
|
||
datasets = [*subjective_datasets]
|
||
|
||
# Model to be evaluated
|
||
models = [*hf_qwen_7b_chat, *hf_chatglm3_6b]
|
||
|
||
# Inference configuration
|
||
infer = dict(
|
||
partitioner=dict(type=NaivePartitioner),
|
||
runner=dict(
|
||
type=SlurmSequentialRunner,
|
||
partition='llmeval',
|
||
quotatype='auto',
|
||
max_num_workers=256,
|
||
task=dict(type=OpenICLInferTask)),
|
||
)
|
||
# Evaluation configuration
|
||
eval = dict(
|
||
partitioner=dict(
|
||
type=SubjectiveNaivePartitioner,
|
||
mode='m2n', # m-model v.s n-model
|
||
# Under m2n setting
|
||
# must specify base_models and compare_models, program will generate pairs between base_models compare_models.
|
||
base_models = [*hf_qwen_14b_chat], # Baseline model
|
||
compare_models = [*hf_baichuan2_7b, *hf_chatglm3_6b] # model to be evaluated
|
||
),
|
||
runner=dict(
|
||
type=SlurmSequentialRunner,
|
||
partition='llmeval',
|
||
quotatype='auto',
|
||
max_num_workers=256,
|
||
task=dict(
|
||
type=SubjectiveEvalTask,
|
||
judge_cfg=gpt4_model # Judge model
|
||
)),
|
||
)
|
||
work_dir = './outputs/subjective/'
|
||
|
||
summarizer = dict(
|
||
type=Corev2Summarizer, # Custom summarizer
|
||
match_method='smart', # Answer extraction
|
||
)
|
||
```
|
||
|
||
In addition, you can also change the response order of the two models, please refer to `config/eval_subjective_compare.py`,
|
||
when `infer_order` is setting to `random`, the response will be random ordered,
|
||
when `infer_order` is setting to `double`, the response of two models will be doubled in two ways.
|
||
|
||
### Step-2: Evaluation Configuration(Score Mode)
|
||
|
||
For `config/eval_subjective_score.py`, it is mainly same with `config/eval_subjective_compare.py`, and you just need to modify the eval mode to `singlescore`.
|
||
|
||
### Step-3: Launch the Evaluation
|
||
|
||
```shell
|
||
python run.py config/eval_subjective_score.py -r
|
||
```
|
||
|
||
The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.
|
||
|
||
The response of JudgeLLM will be output to `output/.../results/timestamp/xxmodel/xxdataset/.json`.
|
||
The evaluation report will be output to `output/.../summary/timestamp/report.csv`.
|
||
|
||
Opencompass has supported lots of JudgeLLM, actually, you can take any model as JudgeLLM in opencompass configs.
|
||
And we list the popular open-source JudgeLLM here:
|
||
|
||
1. Auto-J, refer to `configs/models/judge_llm/auto_j`
|
||
|
||
Consider cite the following paper if you find it helpful:
|
||
|
||
```bibtex
|
||
@article{li2023generative,
|
||
title={Generative judge for evaluating alignment},
|
||
author={Li, Junlong and Sun, Shichao and Yuan, Weizhe and Fan, Run-Ze and Zhao, Hai and Liu, Pengfei},
|
||
journal={arXiv preprint arXiv:2310.05470},
|
||
year={2023}
|
||
}
|
||
@misc{2023opencompass,
|
||
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
|
||
author={OpenCompass Contributors},
|
||
howpublished = {\url{https://github.com/open-compass/opencompass}},
|
||
year={2023}
|
||
}
|
||
```
|
||
|
||
2. JudgeLM, refer to `configs/models/judge_llm/judgelm`
|
||
|
||
```bibtex
|
||
@article{zhu2023judgelm,
|
||
title={JudgeLM: Fine-tuned Large Language Models are Scalable Judges},
|
||
author={Zhu, Lianghui and Wang, Xinggang and Wang, Xinlong},
|
||
journal={arXiv preprint arXiv:2310.17631},
|
||
year={2023}
|
||
}
|
||
@misc{2023opencompass,
|
||
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
|
||
author={OpenCompass Contributors},
|
||
howpublished = {\url{https://github.com/open-compass/opencompass}},
|
||
year={2023}
|
||
}
|
||
```
|
||
|
||
3. PandaLM, refer to `configs/models/judge_llm/pandalm`
|
||
|
||
Consider cite the following paper if you find it helpful:
|
||
|
||
```bibtex
|
||
@article{wang2023pandalm,
|
||
title={PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization},
|
||
author={Wang, Yidong and Yu, Zhuohao and Zeng, Zhengran and Yang, Linyi and Wang, Cunxiang and Chen, Hao and Jiang, Chaoya and Xie, Rui and Wang, Jindong and Xie, Xing and others},
|
||
journal={arXiv preprint arXiv:2306.05087},
|
||
year={2023}
|
||
}
|
||
@misc{2023opencompass,
|
||
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
|
||
author={OpenCompass Contributors},
|
||
howpublished = {\url{https://github.com/open-compass/opencompass}},
|
||
year={2023}
|
||
}
|
||
```
|
||
|
||
## Practice: AlignBench Evaluation
|
||
|
||
### Dataset
|
||
|
||
```bash
|
||
mkdir -p ./data/subjective/
|
||
|
||
cd ./data/subjective
|
||
git clone https://github.com/THUDM/AlignBench.git
|
||
|
||
# data format conversion
|
||
python ../../../tools/convert_alignmentbench.py --mode json --jsonl data/data_release.jsonl
|
||
|
||
```
|
||
|
||
### Configuration
|
||
|
||
Please edit the config `configs/eval_subjective_alignbench.py` according to your demand.
|
||
|
||
### Evaluation
|
||
|
||
```bash
|
||
HF_EVALUATE_OFFLINE=1 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 python run.py workspace/eval_subjective_alignbench.py
|
||
```
|
||
|
||
### Submit to Official Leaderboard(Optional)
|
||
|
||
If you need to submit your prediction into official leaderboard, you can use `tools/convert_alignmentbench.py` for format conversion.
|
||
|
||
- Make sure you have the following results
|
||
|
||
```bash
|
||
outputs/
|
||
└── 20231214_173632
|
||
├── configs
|
||
├── logs
|
||
├── predictions # model's response
|
||
├── results
|
||
└── summary
|
||
```
|
||
|
||
- Convert the data
|
||
|
||
```bash
|
||
python tools/convert_alignmentbench.py --mode csv --exp-folder outputs/20231214_173632
|
||
```
|
||
|
||
- Get `.csv` in `submission/` for submission
|
||
|
||
```bash
|
||
outputs/
|
||
└── 20231214_173632
|
||
├── configs
|
||
├── logs
|
||
├── predictions
|
||
├── results
|
||
├── submission # 可提交文件
|
||
└── summary
|
||
```
|