OpenCompass/docs/en/advanced_guides/subjective_evaluation.md

# Subjective Evaluation Guidance

## Introduction

Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.

To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).

A popular evaluation method involves

- Compare Mode: comparing model responses pairwise to calculate their win rate
- Score Mode: another method involves calculate scores with single model response ([Chatbot Arena](https://chat.lmsys.org/)).

We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of models based on above methods.

## Currently Supported Subjective Evaluation Datasets

1. AlignBench Chinese Scoring Dataset (https://github.com/THUDM/AlignBench)
2. MTBench English Scoring Dataset, two-turn dialogue (https://github.com/lm-sys/FastChat)
3. MTBench101 English Scoring Dataset, multi-turn dialogue (https://github.com/mtbench101/mt-bench-101)
4. AlpacaEvalv2 English Compare Dataset (https://github.com/tatsu-lab/alpaca_eval)
5. ArenaHard English Compare Dataset, mainly focused on coding (https://github.com/lm-sys/arena-hard/tree/main)
6. Fofo English Scoring Dataset (https://github.com/SalesforceAIResearch/FoFo/)
7. Wildbench English Score and Compare Dataset（https://github.com/allenai/WildBench）

## Initiating Subjective Evaluation

Similar to existing objective evaluation methods, you can configure related settings in `configs/eval_subjective.py`.

### Basic Parameters: Specifying models, datasets, and judgemodels

Similar to objective evaluation, import the models and datasets that need to be evaluated, for example:

```
with read_base():
    from .datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
    from .datasets.subjective.alpaca_eval.alpacav2_judgeby_gpt4 import subjective_datasets as alpacav2
    from .models.qwen.hf_qwen_7b import models
```

It is worth noting that since the model setup parameters for subjective evaluation are often different from those for objective evaluation, it often requires setting up `do_sample` for inference instead of `greedy`. You can modify the relevant parameters in the configuration file as needed, for example:

```
models = [
    dict(
        type=HuggingFaceChatGLM3,
        abbr='chatglm3-6b-hf2',
        path='THUDM/chatglm3-6b',
        tokenizer_path='THUDM/chatglm3-6b',
        model_kwargs=dict(
            device_map='auto',
            trust_remote_code=True,
        ),
        tokenizer_kwargs=dict(
            padding_side='left',
            truncation_side='left',
            trust_remote_code=True,
        ),
        generation_kwargs=dict(
            do_sample=True,
        ),
        meta_template=api_meta_template,
        max_out_len=2048,
        max_seq_len=4096,
        batch_size=8,
        run_cfg=dict(num_gpus=1, num_procs=1),
    )
]
```

The judgemodel is usually set to a powerful model like GPT4, and you can directly enter your API key according to the configuration in the config file, or use a custom model as the judgemodel.

### Specifying Other Parameters

In addition to the basic parameters, you can also modify the `infer` and `eval` fields in the config to set a more appropriate partitioning method. The currently supported partitioning methods mainly include three types: NaivePartitioner, SizePartitioner, and NumberWorkPartitioner. You can also specify your own workdir to save related files.

## Subjective Evaluation with Custom Dataset

The specific process includes:

1. Data preparation
2. Model response generation
3. Evaluate the response with a JudgeLLM
4. Generate JudgeLLM's response and calculate the metric

### Step-1: Data Preparation

This step requires preparing the dataset file and implementing your own dataset class under `Opencompass/datasets/subjective/`, returning the read data in the format of `list of dict`.

Actually, you can prepare the data in any format you like (csv, json, jsonl, etc.). However, to make it easier to get started, it is recommended to construct the data according to the format of the existing subjective datasets or according to the following json format.
We provide mini test-set for **Compare Mode** and **Score Mode** as below:

```python
###COREV2
[
    {
        "question": "如果我在空中垂直抛球，球最初向哪个方向行进？",
        "capability": "知识-社会常识",
        "others": {
            "question": "如果我在空中垂直抛球，球最初向哪个方向行进？",
            "evaluating_guidance": "",
            "reference_answer": "上"
        }
    },...]

###CreationV0.1
[
    {
        "question": "请你扮演一个邮件管家，我让你给谁发送什么主题的邮件，你就帮我扩充好邮件正文，并打印在聊天框里。你需要根据我提供的邮件收件人以及邮件主题，来斟酌用词，并使用合适的敬语。现在请给导师发送邮件，询问他是否可以下周三下午15:00进行科研同步会，大约200字。",
        "capability": "邮件通知",
        "others": ""
    },
```

The json must includes the following fields:

- 'question': Question description
- 'capability': The capability dimension of the question.
- 'others': Other needed information.

If you want to modify prompt on each single question, you can full some other information into 'others' and construct it.

### Step-2: Evaluation Configuration(Compare Mode)

Taking Alignbench as an example, `configs/datasets/subjective/alignbench/alignbench_judgeby_critiquellm.py`:

1. First, you need to set `subjective_reader_cfg` to receive the relevant fields returned from the custom Dataset class and specify the output fields when saving files.
2. Then, you need to specify the root path `data_path` of the dataset and the dataset filename `subjective_all_sets`. If there are multiple sub-files, you can add them to this list.
3. Specify `subjective_infer_cfg` and `subjective_eval_cfg` to configure the corresponding inference and evaluation prompts.
4. Specify additional information such as `mode` at the corresponding location. Note that the fields required for different subjective datasets may vary.
5. Define post-processing and score statistics. For example, the postprocessing function `alignbench_postprocess` located under `opencompass/opencompass/datasets/subjective/alignbench`.

### Step-3: Launch the Evaluation

```shell
python run.py config/eval_subjective_score.py -r
```

The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.

The response of JudgeLLM will be output to `output/.../results/timestamp/xxmodel/xxdataset/.json`.
The evaluation report will be output to `output/.../summary/timestamp/report.csv`.

## Multi-round Subjective Evaluation in OpenCompass

In OpenCompass, we also support subjective multi-turn dialogue evaluation. For instance, the evaluation of MT-Bench can be referred to in `configs/datasets/subjective/multiround`.

In the multi-turn dialogue evaluation, you need to organize the data format into the following dialogue structure:

```
"dialogue": [
    {
        "role": "user",
        "content": "Imagine you are participating in a race with a group of people. If you have just overtaken the second person, what's your current position? Where is the person you just overtook?"
    },
    {
        "role": "assistant",
        "content": ""
    },
    {
        "role": "user",
        "content": "If the \"second person\" is changed to \"last person\" in the above question, what would the answer be?"
    },
    {
        "role": "assistant",
        "content": ""
    }
],
```

It's important to note that due to the different question types in MTBench having different temperature settings, we need to divide the original data files into three different subsets according to the temperature for separate inference. For different subsets, we can set different temperatures. For specific settings, please refer to `configs\datasets\subjective\multiround\mtbench_single_judge_diff_temp.py`.
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
+								# Subjective Evaluation Guidance
 								## Introduction
 								Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								A popular evaluation method involves
 								- Compare Mode: comparing model responses pairwise to calculate their win rate
 								- Score Mode: another method involves calculate scores with single model response ([Chatbot Arena](https://chat.lmsys.org/)).
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of models based on above methods.
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Refactor] Reorganize subjective eval (#1284)

* fix pip version

* fix pip version

* reorganize subjective eval

* reorg sub

* reorg subeval

* reorg subeval

* update subjective doc

* reorg subeval

* reorg subeval
											
										
										
											2024-07-05 22:11:37 +08:00
+								## Currently Supported Subjective Evaluation Datasets
-												[Feature] Support AlpacaEval_V2 (#1006)

* support alpacaeval_v2

* support alpacaeval

* update docs

* update docs
											
										
										
											2024-03-28 16:49:04 +08:00
-												[Refactor] Reorganize subjective eval (#1284)

* fix pip version

* fix pip version

* reorganize subjective eval

* reorg sub

* reorg subeval

* reorg subeval

* update subjective doc

* reorg subeval

* reorg subeval
											
										
										
											2024-07-05 22:11:37 +08:00
+. AlignBench Chinese Scoring Dataset (https://github.com/THUDM/AlignBench)
 . MTBench English Scoring Dataset, two-turn dialogue (https://github.com/lm-sys/FastChat)
 . MTBench101 English Scoring Dataset, multi-turn dialogue (https://github.com/mtbench101/mt-bench-101)
 . AlpacaEvalv2 English Compare Dataset (https://github.com/tatsu-lab/alpaca_eval)
 . ArenaHard English Compare Dataset, mainly focused on coding (https://github.com/lm-sys/arena-hard/tree/main)
 . Fofo English Scoring Dataset (https://github.com/SalesforceAIResearch/FoFo/)
-												[Fix] minor update wildbench (#1335)

* update crb

* update crbbench

* update crbbench

* update crbbench

* minor update wildbench

* [Fix] Update doc of wildbench, and merge wildbench into subjective

* [Fix] Update doc of wildbench, and merge wildbench into subjective, fix crbbench

* Update crb.md

* Update crb_pair_judge.py

* Update crb_single_judge.py

* Update subjective_evaluation.md

* Update openai_api.py

* [Update] update wildbench readme

* [Update] update wildbench readme

* [Update] update wildbench readme, remove crb

* Delete configs/eval_subjective_wildbench_pair.py

* Delete configs/eval_subjective_wildbench_single.py

* Update __init__.py

---------

Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>
											
										
										
											2024-07-26 11:19:04 +08:00
+. Wildbench English Score and Compare Dataset（https://github.com/allenai/WildBench）
-												[Refactor] Reorganize subjective eval (#1284)

* fix pip version

* fix pip version

* reorganize subjective eval

* reorg sub

* reorg subeval

* reorg subeval

* update subjective doc

* reorg subeval

* reorg subeval
											
										
										
											2024-07-05 22:11:37 +08:00
 								## Initiating Subjective Evaluation
 								Similar to existing objective evaluation methods, you can configure related settings in `configs/eval_subjective.py`.
 								### Basic Parameters: Specifying models, datasets, and judgemodels
 								Similar to objective evaluation, import the models and datasets that need to be evaluated, for example:
 								```
 								with read_base():
 								    from .datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
 								    from .datasets.subjective.alpaca_eval.alpacav2_judgeby_gpt4 import subjective_datasets as alpacav2
 								    from .models.qwen.hf_qwen_7b import models
 								```
 								It is worth noting that since the model setup parameters for subjective evaluation are often different from those for objective evaluation, it often requires setting up `do_sample` for inference instead of `greedy`. You can modify the relevant parameters in the configuration file as needed, for example:
 								```
 								models = [
 								    dict(
 								        type=HuggingFaceChatGLM3,
 								        abbr='chatglm3-6b-hf2',
 								        path='THUDM/chatglm3-6b',
 								        tokenizer_path='THUDM/chatglm3-6b',
 								        model_kwargs=dict(
 								            device_map='auto',
 								            trust_remote_code=True,
 								        ),
 								        tokenizer_kwargs=dict(
 								            padding_side='left',
 								            truncation_side='left',
 								            trust_remote_code=True,
 								        ),
 								        generation_kwargs=dict(
 								            do_sample=True,
 								        ),
 								        meta_template=api_meta_template,
 								        max_out_len=2048,
 								        max_seq_len=4096,
 								        batch_size=8,
 								        run_cfg=dict(num_gpus=1, num_procs=1),
 								    )
 								]
 								```
 								The judgemodel is usually set to a powerful model like GPT4, and you can directly enter your API key according to the configuration in the config file, or use a custom model as the judgemodel.
 								### Specifying Other Parameters
 								In addition to the basic parameters, you can also modify the `infer` and `eval` fields in the config to set a more appropriate partitioning method. The currently supported partitioning methods mainly include three types: NaivePartitioner, SizePartitioner, and NumberWorkPartitioner. You can also specify your own workdir to save related files.
-												[Feature] Support AlpacaEval_V2 (#1006)

* support alpacaeval_v2

* support alpacaeval

* update docs

* update docs
											
										
										
											2024-03-28 16:49:04 +08:00
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								## Subjective Evaluation with Custom Dataset
 								The specific process includes:
 . Data preparation
 . Model response generation
 . Evaluate the response with a JudgeLLM
 . Generate JudgeLLM's response and calculate the metric
 								### Step-1: Data Preparation
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Refactor] Reorganize subjective eval (#1284)

* fix pip version

* fix pip version

* reorganize subjective eval

* reorg sub

* reorg subeval

* reorg subeval

* update subjective doc

* reorg subeval

* reorg subeval
											
										
										
											2024-07-05 22:11:37 +08:00
+								This step requires preparing the dataset file and implementing your own dataset class under `Opencompass/datasets/subjective/`, returning the read data in the format of `list of dict`.
 								Actually, you can prepare the data in any format you like (csv, json, jsonl, etc.). However, to make it easier to get started, it is recommended to construct the data according to the format of the existing subjective datasets or according to the following json format.
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								We provide mini test-set for **Compare Mode** and **Score Mode** as below:
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								```python
 								###COREV2
 								[
 								    {
 								        "question": "如果我在空中垂直抛球，球最初向哪个方向行进？",
 								        "capability": "知识-社会常识",
 								        "others": {
 								            "question": "如果我在空中垂直抛球，球最初向哪个方向行进？",
 								            "evaluating_guidance": "",
 								            "reference_answer": "上"
 								        }
 								    },...]
 								###CreationV0.1
 								[
 								    {
 								        "question": "请你扮演一个邮件管家，我让你给谁发送什么主题的邮件，你就帮我扩充好邮件正文，并打印在聊天框里。你需要根据我提供的邮件收件人以及邮件主题，来斟酌用词，并使用合适的敬语。现在请给导师发送邮件，询问他是否可以下周三下午15:00进行科研同步会，大约200字。",
 								        "capability": "邮件通知",
 								        "others": ""
 								    },
 								```
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								The json must includes the following fields:
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
 								- 'question': Question description
 								- 'capability': The capability dimension of the question.
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								- 'others': Other needed information.
 								If you want to modify prompt on each single question, you can full some other information into 'others' and construct it.
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								### Step-2: Evaluation Configuration(Compare Mode)
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Refactor] Reorganize subjective eval (#1284)

* fix pip version

* fix pip version

* reorganize subjective eval

* reorg sub

* reorg subeval

* reorg subeval

* update subjective doc

* reorg subeval

* reorg subeval
											
										
										
											2024-07-05 22:11:37 +08:00
+								Taking Alignbench as an example, `configs/datasets/subjective/alignbench/alignbench_judgeby_critiquellm.py`:
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
-												[Refactor] Reorganize subjective eval (#1284)

* fix pip version

* fix pip version

* reorganize subjective eval

* reorg sub

* reorg subeval

* reorg subeval

* update subjective doc

* reorg subeval

* reorg subeval
											
										
										
											2024-07-05 22:11:37 +08:00
+. First, you need to set `subjective_reader_cfg` to receive the relevant fields returned from the custom Dataset class and specify the output fields when saving files.
 . Then, you need to specify the root path `data_path` of the dataset and the dataset filename `subjective_all_sets`. If there are multiple sub-files, you can add them to this list.
 . Specify `subjective_infer_cfg` and `subjective_eval_cfg` to configure the corresponding inference and evaluation prompts.
-												[Update] update docs and add compassarena (#1614)

* fix pip version

* fix pip version

* update docs and add compassarena

* update docs
											
										
										
											2024-10-17 14:39:06 +08:00
+. Specify additional information such as `mode` at the corresponding location. Note that the fields required for different subjective datasets may vary.
 . Define post-processing and score statistics. For example, the postprocessing function `alignbench_postprocess` located under `opencompass/opencompass/datasets/subjective/alignbench`.
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								### Step-3: Launch the Evaluation
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
 								```shell
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								python run.py config/eval_subjective_score.py -r
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
+								```
 								The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								The response of JudgeLLM will be output to `output/.../results/timestamp/xxmodel/xxdataset/.json`.
 								The evaluation report will be output to `output/.../summary/timestamp/report.csv`.
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
-												[fix] add different temp for different question in mtbench (#954)

* add temp for mtbench

* add document for mtbench

* add document for mtbench
											
										
										
											2024-03-11 17:24:39 +08:00
+								## Multi-round Subjective Evaluation in OpenCompass
-												[Refactor] Reorganize subjective eval (#1284)

* fix pip version

* fix pip version

* reorganize subjective eval

* reorg sub

* reorg subeval

* reorg subeval

* update subjective doc

* reorg subeval

* reorg subeval
											
										
										
											2024-07-05 22:11:37 +08:00
+								In OpenCompass, we also support subjective multi-turn dialogue evaluation. For instance, the evaluation of MT-Bench can be referred to in `configs/datasets/subjective/multiround`.
-												[fix] add different temp for different question in mtbench (#954)

* add temp for mtbench

* add document for mtbench

* add document for mtbench
											
										
										
											2024-03-11 17:24:39 +08:00
 								In the multi-turn dialogue evaluation, you need to organize the data format into the following dialogue structure:
 								```
 								"dialogue": [
 								    {
 								        "role": "user",
 								        "content": "Imagine you are participating in a race with a group of people. If you have just overtaken the second person, what's your current position? Where is the person you just overtook?"
 								    },
 								    {
 								        "role": "assistant",
 								        "content": ""
 								    },
 								    {
 								        "role": "user",
 								        "content": "If the \"second person\" is changed to \"last person\" in the above question, what would the answer be?"
 								    },
 								    {
 								        "role": "assistant",
 								        "content": ""
 								    }
 								],
 								```
 								It's important to note that due to the different question types in MTBench having different temperature settings, we need to divide the original data files into three different subsets according to the temperature for separate inference. For different subsets, we can set different temperatures. For specific settings, please refer to `configs\datasets\subjective\multiround\mtbench_single_judge_diff_temp.py`.