mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00
106 lines
4.0 KiB
Markdown
106 lines
4.0 KiB
Markdown
# Subjective Evaluation Guidance
|
||
|
||
## Introduction
|
||
|
||
Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.
|
||
|
||
To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
|
||
|
||
A popular evaluation method involves comparing model responses pairwise to calculate their win rate, another method involves calculate scores with single model response ([Chatbot Arena](https://chat.lmsys.org/)).
|
||
|
||
We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of models based on above methods.
|
||
|
||
## Data Preparation
|
||
|
||
We provide demo test set as below:
|
||
|
||
```python
|
||
###COREV2
|
||
[
|
||
{
|
||
"question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
|
||
"capability": "知识-社会常识",
|
||
"others": {
|
||
"question": "如果我在空中垂直抛球,球最初向哪个方向行进?",
|
||
"evaluating_guidance": "",
|
||
"reference_answer": "上"
|
||
}
|
||
},...]
|
||
|
||
###CreationV0.1
|
||
[
|
||
{
|
||
"question": "请你扮演一个邮件管家,我让你给谁发送什么主题的邮件,你就帮我扩充好邮件正文,并打印在聊天框里。你需要根据我提供的邮件收件人以及邮件主题,来斟酌用词,并使用合适的敬语。现在请给导师发送邮件,询问他是否可以下周三下午15:00进行科研同步会,大约200字。",
|
||
"capability": "邮件通知",
|
||
"others": ""
|
||
},
|
||
```
|
||
|
||
The json must includes the following fields:
|
||
|
||
- 'question': Question description
|
||
- 'capability': The capability dimension of the question.
|
||
- 'others': Other needed information.
|
||
|
||
If you want to modify prompt on each single question, you can full some other information into 'others' and construct it.
|
||
|
||
## Evaluation Configuration
|
||
|
||
The specific process includes:
|
||
|
||
1. Model response reasoning
|
||
2. JudgeLLM evaluation comparisons
|
||
3. Generating evaluation reports
|
||
|
||
### Two Model Compare Configuration
|
||
|
||
For `config/subjective_compare.py`, we provide some annotations to help users understand the configuration file's meaning.
|
||
|
||
```python
|
||
from mmengine.config import read_base
|
||
with read_base():
|
||
from .datasets.subjective_cmp.subjective_corev2 import subjective_datasets
|
||
|
||
from opencompass.summarizers import Corev2Summarizer
|
||
|
||
datasets = [*subjective_datasets] #set dataset
|
||
models = [...] #set models to be evaluated
|
||
judge_model = [...] #set JudgeLLM
|
||
|
||
eval = dict(
|
||
partitioner=dict(
|
||
type=SubjectiveNaivePartitioner,
|
||
mode='m2n', #choose eval mode, in m2n mode,you need to set base_models and compare_models, it will generate the pairs between base_models and compare_models
|
||
base_models = [...],
|
||
compare_models = [...]
|
||
))
|
||
|
||
work_dir = 'Your work dir' #set your workdir, in this workdir, if you use '--reuse', it will reuse all existing results in this workdir automatically
|
||
|
||
summarizer = dict(
|
||
type=Corev2Summarizer, #Your dataset Summarizer
|
||
match_method='smart', #Your answer extract method
|
||
)
|
||
```
|
||
|
||
In addition, you can also change the response order of the two models, please refer to `config/subjective_compare.py`,
|
||
when `infer_order` is setting to `random`, the response will be random ordered,
|
||
when `infer_order` is setting to `double`, the response of two models will be doubled in two ways.
|
||
|
||
### Single Model Scoring Configuration
|
||
|
||
For `config/subjective_score.py`, it is mainly same with `config/subjective_compare.py`, and you just need to modify the eval mode to `singlescore`.
|
||
|
||
## Launching the Evaluation
|
||
|
||
```shell
|
||
python run.py config/subjective.py -r
|
||
```
|
||
|
||
The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.
|
||
|
||
## Evaluation Report
|
||
|
||
The response of JudgeLLM will be output to `output/.../results/timestamp/xxmodel/xxdataset/.json`.
|
||
The evaluation report will be output to `output/.../summary/timestamp/report.csv`.
|