2023-10-27 16:27:24 +08:00
# Subjective Evaluation Guidance
## Introduction
Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.
2023-12-11 22:22:11 +08:00
To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
2023-10-27 16:27:24 +08:00
2023-12-11 22:22:11 +08:00
A popular evaluation method involves comparing model responses pairwise to calculate their win rate, another method involves calculate scores with single model response ([Chatbot Arena](https://chat.lmsys.org/)).
2023-10-27 16:27:24 +08:00
2023-12-11 22:22:11 +08:00
We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of models based on above methods.
2023-10-27 16:27:24 +08:00
## Data Preparation
2023-12-11 22:22:11 +08:00
We provide demo test set as below:
2023-10-27 16:27:24 +08:00
2023-12-11 22:22:11 +08:00
```python
###COREV2
[
{
"question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
"capability": "知识-社会常识",
"others": {
"question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
"evaluating_guidance": "",
"reference_answer": "上"
}
},...]
###CreationV0.1
[
{
"question": "请你扮演一个邮件管家, 我让你给谁发送什么主题的邮件, 你就帮我扩充好邮件正文, 并打印在聊天框里。你需要根据我提供的邮件收件人以及邮件主题, 来斟酌用词, 并使用合适的敬语。现在请给导师发送邮件, 询问他是否可以下周三下午15:00进行科研同步会, 大约200字。",
"capability": "邮件通知",
"others": ""
},
```
2023-10-27 16:27:24 +08:00
2023-12-11 22:22:11 +08:00
The json must includes the following fields:
2023-10-27 16:27:24 +08:00
- 'question': Question description
- 'capability': The capability dimension of the question.
2023-12-11 22:22:11 +08:00
- 'others': Other needed information.
If you want to modify prompt on each single question, you can full some other information into 'others' and construct it.
2023-10-27 16:27:24 +08:00
## Evaluation Configuration
The specific process includes:
1. Model response reasoning
2023-12-11 22:22:11 +08:00
2. JudgeLLM evaluation comparisons
2023-10-27 16:27:24 +08:00
3. Generating evaluation reports
2023-12-11 22:22:11 +08:00
### Two Model Compare Configuration
For `config/subjective_compare.py` , we provide some annotations to help users understand the configuration file's meaning.
2023-10-27 16:27:24 +08:00
```python
from mmengine.config import read_base
with read_base():
2023-12-11 22:22:11 +08:00
from .datasets.subjective_cmp.subjective_corev2 import subjective_datasets
2023-10-27 16:27:24 +08:00
2023-12-11 22:22:11 +08:00
from opencompass.summarizers import Corev2Summarizer
2023-10-27 16:27:24 +08:00
2023-12-11 22:22:11 +08:00
datasets = [*subjective_datasets] #set dataset
models = [...] #set models to be evaluated
judge_model = [...] #set JudgeLLM
2023-10-27 16:27:24 +08:00
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
2023-12-11 22:22:11 +08:00
mode='m2n', #choose eval mode, in m2n mode, you need to set base_models and compare_models, it will generate the pairs between base_models and compare_models
base_models = [...],
compare_models = [...]
))
work_dir = 'Your work dir' #set your workdir, in this workdir, if you use '--reuse', it will reuse all existing results in this workdir automatically
summarizer = dict(
type=Corev2Summarizer, #Your dataset Summarizer
match_method='smart', #Your answer extract method
2023-10-27 16:27:24 +08:00
)
```
2023-12-12 20:58:17 +08:00
In addition, you can also change the response order of the two models, please refer to `config/subjective_compare.py` ,
when `infer_order` is setting to `random` , the response will be random ordered,
when `infer_order` is setting to `double` , the response of two models will be doubled in two ways.
2023-12-11 22:22:11 +08:00
### Single Model Scoring Configuration
For `config/subjective_score.py` , it is mainly same with `config/subjective_compare.py` , and you just need to modify the eval mode to `singlescore` .
2023-10-27 16:27:24 +08:00
## Launching the Evaluation
```shell
python run.py config/subjective.py -r
```
The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.
## Evaluation Report
2023-12-11 22:22:11 +08:00
The response of JudgeLLM will be output to `output/.../results/timestamp/xxmodel/xxdataset/.json` .
The evaluation report will be output to `output/.../summary/timestamp/report.csv` .