OpenCompass/docs/en/advanced_guides/subjective_evaluation.md

# Subjective Evaluation Guidance

## Introduction

Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.

To explore the model's subjective capabilities, we employ state-of-the-art LLM (GPT-4) as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).

A popular evaluation method involves comparing model responses pairwise to calculate their win rate ([Chatbot Arena](https://chat.lmsys.org/)).

We support the use of GPT-4 for the subjective evaluation of models based on this method.

## Data Preparation

We provide a demo test set [subjective_demo.xlsx](https://opencompass.openxlab.space/utils/subjective_demo.xlsx) based on [z-bench](https://github.com/zhenbench/z-bench).

Store the set of subjective questions in .xlsx format in the `data/subjective/directory`.

The table includes the following fields:

- 'question': Question description
- 'index': Question number
- 'reference_answer': Reference answer
- 'evaluating_guidance': Evaluation guidance
- 'capability': The capability dimension of the question.

## Evaluation Configuration

The specific process includes:

1. Model response reasoning
2. GPT-4 evaluation comparisons
3. Generating evaluation reports

For `config/subjective.py`, we provide some annotations to help users understand the configuration file's meaning.

```python
# Import datasets and subjective evaluation summarizer
from mmengine.config import read_base
with read_base():
    from .datasets.subjective_cmp.subjective_cmp import subjective_datasets
    from .summarizers.subjective import summarizer

datasets = [*subjective_datasets]

from opencompass.models import HuggingFaceCausalLM, HuggingFace, OpenAI

# Import partitioner and task required for subjective evaluation
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks.subjective_eval import SubjectiveEvalTask


# Define model configurations for inference and evaluation
# Including the inference models chatglm2-6b, qwen-7b-chat, internlm-chat-7b, and the evaluation model gpt4
models = [...]

api_meta_template = dict(
    round=[
        dict(role='HUMAN', api_role='HUMAN'),
        dict(role='BOT', api_role='BOT', generate=True)
    ],
    reserved_roles=[
        dict(role='SYSTEM', api_role='SYSTEM'),
    ],
)

# Define the configuration for subjective evaluation
eval = dict(
    partitioner=dict(
        type=SubjectiveNaivePartitioner,
        mode='all',  # alternately constructs two for comparisons
    ),
    runner=dict(
        type=LocalRunner,
        max_num_workers=2,  # Supports parallel comparisons
        task=dict(
            type=SubjectiveEvalTask,  # Used to read inputs for a pair of models
            judge_cfg=dict(
                abbr='GPT4',
                type=OpenAI,
                path='gpt-4-0613',
                key='ENV',
                meta_template=api_meta_template,
                query_per_second=1,
                max_out_len=2048,
                max_seq_len=2048,
                batch_size=2),
        )),
)
```

## Launching the Evaluation

```shell
python run.py config/subjective.py -r
```

The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.

## Evaluation Report

The evaluation report will be output to `output/.../summary/timestamp/report.md`, which includes win rate statistics, battle scores, and ELO ratings. The specific format is as follows:

```markdown
# Subjective Analysis

A total of 30 comparisons, of which 30 comparisons are meaningful (A / B answers inconsistent)
A total of 30 answer comparisons, successfully extracted 30 answers from GPT-4 replies, with an extraction success rate of 100.00%

### Basic statistics (4 stats: win / tie / lose / not bad)

| Dimension \ Stat [W / T / L / NB] | chatglm2-6b-hf                | qwen-7b-chat-hf              | internlm-chat-7b-hf           |
| --------------------------------- | ----------------------------- | ---------------------------- | ----------------------------- |
| LANG: Overall                     | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |
| LANG: CN                          | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |
| LANG: EN                          | N/A                           | N/A                          | N/A                           |
| CAPA: common                      | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |

![Capabilities Dimension Classification Result](by_capa.png)

![Language Classification  Result](by_lang.png)

### Model scores (base score is 0, win +3, both +1, neither -1, lose -3)

| Dimension \ Score | chatglm2-6b-hf | qwen-7b-chat-hf | internlm-chat-7b-hf |
| ----------------- | -------------- | --------------- | ------------------- |
| LANG: Overall     | -8             | 0               | -8                  |
| LANG: CN          | -8             | 0               | -8                  |
| LANG: EN          | N/A            | N/A             | N/A                 |
| CAPA: common      | -8             | 0               | -8                  |

### Bootstrap ELO, Median of n=1000 times

|                  | chatglm2-6b-hf | internlm-chat-7b-hf | qwen-7b-chat-hf |
| ---------------- | -------------- | ------------------- | --------------- |
| elo_score [Mean] | 999.504        | 999.912             | 1000.26         |
| elo_score [Std]  | 0.621362       | 0.400226            | 0.694434        |
```

For comparing the evaluation of models A and B, there are four choices:

1. A is better than B.
2. A and B are equally good.
3. A is worse than B.
4. Neither A nor B is good.

So, `win` / `tie` / `lose` / `not bad` represent the proportions of the model winning / tying / losing / winning or being equally good, respectively.

`Bootstrap ELO` is calculated as the median ELO score by comparing match results through 1000 random permutations.
[Doc] Update Subjective docs (#510) * rename * add en subdoc * fix name * fix writing * update --------- Co-authored-by: Leymore <zfz-960727@163.com> 2023-10-27 16:27:24 +08:00			`# Subjective Evaluation Guidance`

			`## Introduction`

			`Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.`

			`To explore the model's subjective capabilities, we employ state-of-the-art LLM (GPT-4) as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).`

			`A popular evaluation method involves comparing model responses pairwise to calculate their win rate ([Chatbot Arena](https://chat.lmsys.org/)).`

			`We support the use of GPT-4 for the subjective evaluation of models based on this method.`

			`## Data Preparation`

			`We provide a demo test set [subjective_demo.xlsx](https://opencompass.openxlab.space/utils/subjective_demo.xlsx) based on [z-bench](https://github.com/zhenbench/z-bench).`

			Store the set of subjective questions in .xlsx format in the `data/subjective/directory`.

			`The table includes the following fields:`

			`- 'question': Question description`
			`- 'index': Question number`
			`- 'reference_answer': Reference answer`
			`- 'evaluating_guidance': Evaluation guidance`
			`- 'capability': The capability dimension of the question.`

			`## Evaluation Configuration`

			`The specific process includes:`

			`1. Model response reasoning`
			`2. GPT-4 evaluation comparisons`
			`3. Generating evaluation reports`

			For `config/subjective.py`, we provide some annotations to help users understand the configuration file's meaning.

			```python
			`# Import datasets and subjective evaluation summarizer`
			`from mmengine.config import read_base`
			`with read_base():`
			`from .datasets.subjective_cmp.subjective_cmp import subjective_datasets`
			`from .summarizers.subjective import summarizer`

			`datasets = [*subjective_datasets]`

			`from opencompass.models import HuggingFaceCausalLM, HuggingFace, OpenAI`

			`# Import partitioner and task required for subjective evaluation`
			`from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner`
			`from opencompass.runners import LocalRunner`
			`from opencompass.tasks.subjective_eval import SubjectiveEvalTask`


			`# Define model configurations for inference and evaluation`
			`# Including the inference models chatglm2-6b, qwen-7b-chat, internlm-chat-7b, and the evaluation model gpt4`
			`models = [...]`

			`api_meta_template = dict(`
			`round=[`
			`dict(role='HUMAN', api_role='HUMAN'),`
			`dict(role='BOT', api_role='BOT', generate=True)`
			`],`
			`reserved_roles=[`
			`dict(role='SYSTEM', api_role='SYSTEM'),`
			`],`
			`)`

			`# Define the configuration for subjective evaluation`
			`eval = dict(`
			`partitioner=dict(`
			`type=SubjectiveNaivePartitioner,`
			`mode='all', # alternately constructs two for comparisons`
			`),`
			`runner=dict(`
			`type=LocalRunner,`
			`max_num_workers=2, # Supports parallel comparisons`
			`task=dict(`
			`type=SubjectiveEvalTask, # Used to read inputs for a pair of models`
			`judge_cfg=dict(`
			`abbr='GPT4',`
			`type=OpenAI,`
			`path='gpt-4-0613',`
			`key='ENV',`
			`meta_template=api_meta_template,`
			`query_per_second=1,`
			`max_out_len=2048,`
			`max_seq_len=2048,`
			`batch_size=2),`
			`)),`
			`)`
			```

			`## Launching the Evaluation`

			```shell
			`python run.py config/subjective.py -r`
			```

			The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.

			`## Evaluation Report`

			The evaluation report will be output to `output/.../summary/timestamp/report.md`, which includes win rate statistics, battle scores, and ELO ratings. The specific format is as follows:

			```markdown
			`# Subjective Analysis`

			`A total of 30 comparisons, of which 30 comparisons are meaningful (A / B answers inconsistent)`
			`A total of 30 answer comparisons, successfully extracted 30 answers from GPT-4 replies, with an extraction success rate of 100.00%`

			`### Basic statistics (4 stats: win / tie / lose / not bad)`

			`\| Dimension \ Stat [W / T / L / NB] \| chatglm2-6b-hf \| qwen-7b-chat-hf \| internlm-chat-7b-hf \|`
			`\| --------------------------------- \| ----------------------------- \| ---------------------------- \| ----------------------------- \|`
			`\| LANG: Overall \| 30.0% / 40.0% / 30.0% / 30.0% \| 50.0% / 0.0% / 50.0% / 50.0% \| 30.0% / 40.0% / 30.0% / 30.0% \|`
			`\| LANG: CN \| 30.0% / 40.0% / 30.0% / 30.0% \| 50.0% / 0.0% / 50.0% / 50.0% \| 30.0% / 40.0% / 30.0% / 30.0% \|`
			`\| LANG: EN \| N/A \| N/A \| N/A \|`
			`\| CAPA: common \| 30.0% / 40.0% / 30.0% / 30.0% \| 50.0% / 0.0% / 50.0% / 50.0% \| 30.0% / 40.0% / 30.0% / 30.0% \|`

			`![Capabilities Dimension Classification Result](by_capa.png)`

			`![Language Classification Result](by_lang.png)`

			`### Model scores (base score is 0, win +3, both +1, neither -1, lose -3)`

			`\| Dimension \ Score \| chatglm2-6b-hf \| qwen-7b-chat-hf \| internlm-chat-7b-hf \|`
			`\| ----------------- \| -------------- \| --------------- \| ------------------- \|`
			`\| LANG: Overall \| -8 \| 0 \| -8 \|`
			`\| LANG: CN \| -8 \| 0 \| -8 \|`
			`\| LANG: EN \| N/A \| N/A \| N/A \|`
			`\| CAPA: common \| -8 \| 0 \| -8 \|`

			`### Bootstrap ELO, Median of n=1000 times`

			`\| \| chatglm2-6b-hf \| internlm-chat-7b-hf \| qwen-7b-chat-hf \|`
			`\| ---------------- \| -------------- \| ------------------- \| --------------- \|`
			`\| elo_score [Mean] \| 999.504 \| 999.912 \| 1000.26 \|`
			`\| elo_score [Std] \| 0.621362 \| 0.400226 \| 0.694434 \|`
			```

			`For comparing the evaluation of models A and B, there are four choices:`

			`1. A is better than B.`
			`2. A and B are equally good.`
			`3. A is worse than B.`
			`4. Neither A nor B is good.`

			So, `win` / `tie` / `lose` / `not bad` represent the proportions of the model winning / tying / losing / winning or being equally good, respectively.

			`Bootstrap ELO` is calculated as the median ELO score by comparing match results through 1000 random permutations.