OpenCompass/docs/en/advanced_guides/subjective_evaluation.md
Wei Jueqi b62842335d
[Doc] Update Subjective docs (#510)
* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2023-10-27 16:27:24 +08:00

151 lines
6.1 KiB
Markdown

# Subjective Evaluation Guidance
## Introduction
Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.
To explore the model's subjective capabilities, we employ state-of-the-art LLM (GPT-4) as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
A popular evaluation method involves comparing model responses pairwise to calculate their win rate ([Chatbot Arena](https://chat.lmsys.org/)).
We support the use of GPT-4 for the subjective evaluation of models based on this method.
## Data Preparation
We provide a demo test set [subjective_demo.xlsx](https://opencompass.openxlab.space/utils/subjective_demo.xlsx) based on [z-bench](https://github.com/zhenbench/z-bench).
Store the set of subjective questions in .xlsx format in the `data/subjective/directory`.
The table includes the following fields:
- 'question': Question description
- 'index': Question number
- 'reference_answer': Reference answer
- 'evaluating_guidance': Evaluation guidance
- 'capability': The capability dimension of the question.
## Evaluation Configuration
The specific process includes:
1. Model response reasoning
2. GPT-4 evaluation comparisons
3. Generating evaluation reports
For `config/subjective.py`, we provide some annotations to help users understand the configuration file's meaning.
```python
# Import datasets and subjective evaluation summarizer
from mmengine.config import read_base
with read_base():
from .datasets.subjective_cmp.subjective_cmp import subjective_datasets
from .summarizers.subjective import summarizer
datasets = [*subjective_datasets]
from opencompass.models import HuggingFaceCausalLM, HuggingFace, OpenAI
# Import partitioner and task required for subjective evaluation
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
# Define model configurations for inference and evaluation
# Including the inference models chatglm2-6b, qwen-7b-chat, internlm-chat-7b, and the evaluation model gpt4
models = [...]
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True)
],
reserved_roles=[
dict(role='SYSTEM', api_role='SYSTEM'),
],
)
# Define the configuration for subjective evaluation
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
mode='all', # alternately constructs two for comparisons
),
runner=dict(
type=LocalRunner,
max_num_workers=2, # Supports parallel comparisons
task=dict(
type=SubjectiveEvalTask, # Used to read inputs for a pair of models
judge_cfg=dict(
abbr='GPT4',
type=OpenAI,
path='gpt-4-0613',
key='ENV',
meta_template=api_meta_template,
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=2),
)),
)
```
## Launching the Evaluation
```shell
python run.py config/subjective.py -r
```
The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.
## Evaluation Report
The evaluation report will be output to `output/.../summary/timestamp/report.md`, which includes win rate statistics, battle scores, and ELO ratings. The specific format is as follows:
```markdown
# Subjective Analysis
A total of 30 comparisons, of which 30 comparisons are meaningful (A / B answers inconsistent)
A total of 30 answer comparisons, successfully extracted 30 answers from GPT-4 replies, with an extraction success rate of 100.00%
### Basic statistics (4 stats: win / tie / lose / not bad)
| Dimension \ Stat [W / T / L / NB] | chatglm2-6b-hf | qwen-7b-chat-hf | internlm-chat-7b-hf |
| --------------------------------- | ----------------------------- | ---------------------------- | ----------------------------- |
| LANG: Overall | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |
| LANG: CN | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |
| LANG: EN | N/A | N/A | N/A |
| CAPA: common | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |
![Capabilities Dimension Classification Result](by_capa.png)
![Language Classification Result](by_lang.png)
### Model scores (base score is 0, win +3, both +1, neither -1, lose -3)
| Dimension \ Score | chatglm2-6b-hf | qwen-7b-chat-hf | internlm-chat-7b-hf |
| ----------------- | -------------- | --------------- | ------------------- |
| LANG: Overall | -8 | 0 | -8 |
| LANG: CN | -8 | 0 | -8 |
| LANG: EN | N/A | N/A | N/A |
| CAPA: common | -8 | 0 | -8 |
### Bootstrap ELO, Median of n=1000 times
| | chatglm2-6b-hf | internlm-chat-7b-hf | qwen-7b-chat-hf |
| ---------------- | -------------- | ------------------- | --------------- |
| elo_score [Mean] | 999.504 | 999.912 | 1000.26 |
| elo_score [Std] | 0.621362 | 0.400226 | 0.694434 |
```
For comparing the evaluation of models A and B, there are four choices:
1. A is better than B.
2. A and B are equally good.
3. A is worse than B.
4. Neither A nor B is good.
So, `win` / `tie` / `lose` / `not bad` represent the proportions of the model winning / tying / losing / winning or being equally good, respectively.
`Bootstrap ELO` is calculated as the median ELO score by comparing match results through 1000 random permutations.