mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00
151 lines
6.1 KiB
Markdown
151 lines
6.1 KiB
Markdown
![]() |
# Subjective Evaluation Guidance
|
||
|
|
||
|
## Introduction
|
||
|
|
||
|
Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.
|
||
|
|
||
|
To explore the model's subjective capabilities, we employ state-of-the-art LLM (GPT-4) as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
|
||
|
|
||
|
A popular evaluation method involves comparing model responses pairwise to calculate their win rate ([Chatbot Arena](https://chat.lmsys.org/)).
|
||
|
|
||
|
We support the use of GPT-4 for the subjective evaluation of models based on this method.
|
||
|
|
||
|
## Data Preparation
|
||
|
|
||
|
We provide a demo test set [subjective_demo.xlsx](https://opencompass.openxlab.space/utils/subjective_demo.xlsx) based on [z-bench](https://github.com/zhenbench/z-bench).
|
||
|
|
||
|
Store the set of subjective questions in .xlsx format in the `data/subjective/directory`.
|
||
|
|
||
|
The table includes the following fields:
|
||
|
|
||
|
- 'question': Question description
|
||
|
- 'index': Question number
|
||
|
- 'reference_answer': Reference answer
|
||
|
- 'evaluating_guidance': Evaluation guidance
|
||
|
- 'capability': The capability dimension of the question.
|
||
|
|
||
|
## Evaluation Configuration
|
||
|
|
||
|
The specific process includes:
|
||
|
|
||
|
1. Model response reasoning
|
||
|
2. GPT-4 evaluation comparisons
|
||
|
3. Generating evaluation reports
|
||
|
|
||
|
For `config/subjective.py`, we provide some annotations to help users understand the configuration file's meaning.
|
||
|
|
||
|
```python
|
||
|
# Import datasets and subjective evaluation summarizer
|
||
|
from mmengine.config import read_base
|
||
|
with read_base():
|
||
|
from .datasets.subjective_cmp.subjective_cmp import subjective_datasets
|
||
|
from .summarizers.subjective import summarizer
|
||
|
|
||
|
datasets = [*subjective_datasets]
|
||
|
|
||
|
from opencompass.models import HuggingFaceCausalLM, HuggingFace, OpenAI
|
||
|
|
||
|
# Import partitioner and task required for subjective evaluation
|
||
|
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
|
||
|
from opencompass.runners import LocalRunner
|
||
|
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
|
||
|
|
||
|
|
||
|
# Define model configurations for inference and evaluation
|
||
|
# Including the inference models chatglm2-6b, qwen-7b-chat, internlm-chat-7b, and the evaluation model gpt4
|
||
|
models = [...]
|
||
|
|
||
|
api_meta_template = dict(
|
||
|
round=[
|
||
|
dict(role='HUMAN', api_role='HUMAN'),
|
||
|
dict(role='BOT', api_role='BOT', generate=True)
|
||
|
],
|
||
|
reserved_roles=[
|
||
|
dict(role='SYSTEM', api_role='SYSTEM'),
|
||
|
],
|
||
|
)
|
||
|
|
||
|
# Define the configuration for subjective evaluation
|
||
|
eval = dict(
|
||
|
partitioner=dict(
|
||
|
type=SubjectiveNaivePartitioner,
|
||
|
mode='all', # alternately constructs two for comparisons
|
||
|
),
|
||
|
runner=dict(
|
||
|
type=LocalRunner,
|
||
|
max_num_workers=2, # Supports parallel comparisons
|
||
|
task=dict(
|
||
|
type=SubjectiveEvalTask, # Used to read inputs for a pair of models
|
||
|
judge_cfg=dict(
|
||
|
abbr='GPT4',
|
||
|
type=OpenAI,
|
||
|
path='gpt-4-0613',
|
||
|
key='ENV',
|
||
|
meta_template=api_meta_template,
|
||
|
query_per_second=1,
|
||
|
max_out_len=2048,
|
||
|
max_seq_len=2048,
|
||
|
batch_size=2),
|
||
|
)),
|
||
|
)
|
||
|
```
|
||
|
|
||
|
## Launching the Evaluation
|
||
|
|
||
|
```shell
|
||
|
python run.py config/subjective.py -r
|
||
|
```
|
||
|
|
||
|
The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.
|
||
|
|
||
|
## Evaluation Report
|
||
|
|
||
|
The evaluation report will be output to `output/.../summary/timestamp/report.md`, which includes win rate statistics, battle scores, and ELO ratings. The specific format is as follows:
|
||
|
|
||
|
```markdown
|
||
|
# Subjective Analysis
|
||
|
|
||
|
A total of 30 comparisons, of which 30 comparisons are meaningful (A / B answers inconsistent)
|
||
|
A total of 30 answer comparisons, successfully extracted 30 answers from GPT-4 replies, with an extraction success rate of 100.00%
|
||
|
|
||
|
### Basic statistics (4 stats: win / tie / lose / not bad)
|
||
|
|
||
|
| Dimension \ Stat [W / T / L / NB] | chatglm2-6b-hf | qwen-7b-chat-hf | internlm-chat-7b-hf |
|
||
|
| --------------------------------- | ----------------------------- | ---------------------------- | ----------------------------- |
|
||
|
| LANG: Overall | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |
|
||
|
| LANG: CN | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |
|
||
|
| LANG: EN | N/A | N/A | N/A |
|
||
|
| CAPA: common | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |
|
||
|
|
||
|

|
||
|
|
||
|

|
||
|
|
||
|
### Model scores (base score is 0, win +3, both +1, neither -1, lose -3)
|
||
|
|
||
|
| Dimension \ Score | chatglm2-6b-hf | qwen-7b-chat-hf | internlm-chat-7b-hf |
|
||
|
| ----------------- | -------------- | --------------- | ------------------- |
|
||
|
| LANG: Overall | -8 | 0 | -8 |
|
||
|
| LANG: CN | -8 | 0 | -8 |
|
||
|
| LANG: EN | N/A | N/A | N/A |
|
||
|
| CAPA: common | -8 | 0 | -8 |
|
||
|
|
||
|
### Bootstrap ELO, Median of n=1000 times
|
||
|
|
||
|
| | chatglm2-6b-hf | internlm-chat-7b-hf | qwen-7b-chat-hf |
|
||
|
| ---------------- | -------------- | ------------------- | --------------- |
|
||
|
| elo_score [Mean] | 999.504 | 999.912 | 1000.26 |
|
||
|
| elo_score [Std] | 0.621362 | 0.400226 | 0.694434 |
|
||
|
```
|
||
|
|
||
|
For comparing the evaluation of models A and B, there are four choices:
|
||
|
|
||
|
1. A is better than B.
|
||
|
2. A and B are equally good.
|
||
|
3. A is worse than B.
|
||
|
4. Neither A nor B is good.
|
||
|
|
||
|
So, `win` / `tie` / `lose` / `not bad` represent the proportions of the model winning / tying / losing / winning or being equally good, respectively.
|
||
|
|
||
|
`Bootstrap ELO` is calculated as the median ELO score by comparing match results through 1000 random permutations.
|