OpenCompass/docs/en/advanced_guides/subjective_evaluation.md

# Subjective Evaluation Guidance

## Introduction

Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.

To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).

A popular evaluation method involves

- Compare Mode: comparing model responses pairwise to calculate their win rate
- Score Mode: another method involves calculate scores with single model response ([Chatbot Arena](https://chat.lmsys.org/)).

We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of models based on above methods.

## Current Supported Subjective Evaluation Datasets

1. AlginBench (https://github.com/THUDM/AlignBench)
2. MTBench (https://github.com/lm-sys/FastChat)
3. AlpacaEvalv2 (https://github.com/tatsu-lab/alpaca_eval)
4. CompassArena (Internal dataset)

## Subjective Evaluation with Custom Dataset

The specific process includes:

1. Data preparation
2. Model response generation
3. Evaluate the response with a JudgeLLM
4. Generate JudgeLLM's response and calculate the metric

### Step-1: Data Preparation

We provide mini test-set for **Compare Mode** and **Score Mode** as below:

```python
###COREV2
[
    {
        "question": "如果我在空中垂直抛球，球最初向哪个方向行进？",
        "capability": "知识-社会常识",
        "others": {
            "question": "如果我在空中垂直抛球，球最初向哪个方向行进？",
            "evaluating_guidance": "",
            "reference_answer": "上"
        }
    },...]

###CreationV0.1
[
    {
        "question": "请你扮演一个邮件管家，我让你给谁发送什么主题的邮件，你就帮我扩充好邮件正文，并打印在聊天框里。你需要根据我提供的邮件收件人以及邮件主题，来斟酌用词，并使用合适的敬语。现在请给导师发送邮件，询问他是否可以下周三下午15:00进行科研同步会，大约200字。",
        "capability": "邮件通知",
        "others": ""
    },
```

The json must includes the following fields:

- 'question': Question description
- 'capability': The capability dimension of the question.
- 'others': Other needed information.

If you want to modify prompt on each single question, you can full some other information into 'others' and construct it.

### Step-2: Evaluation Configuration(Compare Mode)

For `config/eval_subjective_compare.py`, we provide some annotations to help users understand the configuration file.

```python

from mmengine.config import read_base
from opencompass.models import HuggingFaceCausalLM, HuggingFace, OpenAI

from opencompass.partitioners import NaivePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import Corev2Summarizer

with read_base():
    # Pre-defined models
    from .models.qwen.hf_qwen_7b_chat import models as hf_qwen_7b_chat
    from .models.chatglm.hf_chatglm3_6b import models as hf_chatglm3_6b
    from .models.qwen.hf_qwen_14b_chat import models as hf_qwen_14b_chat
    from .models.openai.gpt_4 import models as gpt4_model
    from .datasets.subjective_cmp.subjective_corev2 import subjective_datasets

# Evaluation datasets
datasets = [*subjective_datasets]

# Model to be evaluated
models = [*hf_qwen_7b_chat, *hf_chatglm3_6b]

# Inference configuration
infer = dict(
    partitioner=dict(type=NaivePartitioner),
    runner=dict(
        type=SlurmSequentialRunner,
        partition='llmeval',
        quotatype='auto',
        max_num_workers=256,
        task=dict(type=OpenICLInferTask)),
)
# Evaluation configuration
eval = dict(
    partitioner=dict(
        type=SubjectiveNaivePartitioner,
        mode='m2n', # m-model v.s n-model
        # Under m2n setting
        # must specify base_models and compare_models, program will generate pairs between base_models compare_models.
        base_models = [*hf_qwen_14b_chat], # Baseline model
        compare_models = [*hf_baichuan2_7b, *hf_chatglm3_6b] # model to be evaluated
    ),
    runner=dict(
        type=SlurmSequentialRunner,
        partition='llmeval',
        quotatype='auto',
        max_num_workers=256,
        task=dict(
            type=SubjectiveEvalTask,
        judge_cfg=gpt4_model # Judge model
        )),
)
work_dir = './outputs/subjective/'

summarizer = dict(
    type=Corev2Summarizer,  # Custom summarizer
    match_method='smart', # Answer extraction
)
```

In addition, you can also change the response order of the two models, please refer to `config/eval_subjective_compare.py`,
when `infer_order` is setting to `random`, the response will be random ordered,
when `infer_order` is setting to `double`, the response of two models will be doubled in two ways.

### Step-2: Evaluation Configuration(Score Mode)

For `config/eval_subjective_score.py`, it is mainly same with `config/eval_subjective_compare.py`, and you just need to modify the eval mode to `singlescore`.

### Step-3: Launch the Evaluation

```shell
python run.py config/eval_subjective_score.py -r
```

The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.

The response of JudgeLLM will be output to `output/.../results/timestamp/xxmodel/xxdataset/.json`.
The evaluation report will be output to `output/.../summary/timestamp/report.csv`.

Opencompass has supported lots of JudgeLLM, actually, you can take any model as JudgeLLM in opencompass configs.
And we list the popular open-source JudgeLLM here:

1. Auto-J, refer to `configs/models/judge_llm/auto_j`

Consider cite the following paper if you find it helpful:

```bibtex
@article{li2023generative,
  title={Generative judge for evaluating alignment},
  author={Li, Junlong and Sun, Shichao and Yuan, Weizhe and Fan, Run-Ze and Zhao, Hai and Liu, Pengfei},
  journal={arXiv preprint arXiv:2310.05470},
  year={2023}
}
@misc{2023opencompass,
    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
    author={OpenCompass Contributors},
    howpublished = {\url{https://github.com/open-compass/opencompass}},
    year={2023}
}
```

2. JudgeLM, refer to `configs/models/judge_llm/judgelm`

```bibtex
@article{zhu2023judgelm,
  title={JudgeLM: Fine-tuned Large Language Models are Scalable Judges},
  author={Zhu, Lianghui and Wang, Xinggang and Wang, Xinlong},
  journal={arXiv preprint arXiv:2310.17631},
  year={2023}
}
@misc{2023opencompass,
    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
    author={OpenCompass Contributors},
    howpublished = {\url{https://github.com/open-compass/opencompass}},
    year={2023}
}
```

3. PandaLM, refer to `configs/models/judge_llm/pandalm`

Consider cite the following paper if you find it helpful:

```bibtex
@article{wang2023pandalm,
  title={PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization},
  author={Wang, Yidong and Yu, Zhuohao and Zeng, Zhengran and Yang, Linyi and Wang, Cunxiang and Chen, Hao and Jiang, Chaoya and Xie, Rui and Wang, Jindong and Xie, Xing and others},
  journal={arXiv preprint arXiv:2306.05087},
  year={2023}
}
@misc{2023opencompass,
    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
    author={OpenCompass Contributors},
    howpublished = {\url{https://github.com/open-compass/opencompass}},
    year={2023}
}
```

## Multi-round Subjective Evaluation in OpenCompass

In OpenCompass, we also support subjective multi-turn dialogue evaluation. For instance, the evaluation of MT-Bench can be referred to in `configs/eval_subjective_mtbench.py`.

In the multi-turn dialogue evaluation, you need to organize the data format into the following dialogue structure:

```
"dialogue": [
    {
        "role": "user",
        "content": "Imagine you are participating in a race with a group of people. If you have just overtaken the second person, what's your current position? Where is the person you just overtook?"
    },
    {
        "role": "assistant",
        "content": ""
    },
    {
        "role": "user",
        "content": "If the \"second person\" is changed to \"last person\" in the above question, what would the answer be?"
    },
    {
        "role": "assistant",
        "content": ""
    }
],
```

It's important to note that due to the different question types in MTBench having different temperature settings, we need to divide the original data files into three different subsets according to the temperature for separate inference. For different subsets, we can set different temperatures. For specific settings, please refer to `configs\datasets\subjective\multiround\mtbench_single_judge_diff_temp.py`.

Consider cite the following paper if you find it helpful:

```bibtex
@misc{zheng2023judging,
      title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
      author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
      year={2023},
      eprint={2306.05685},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{2023opencompass,
    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
    author={OpenCompass Contributors},
    howpublished = {\url{https://github.com/open-compass/opencompass}},
    year={2023}
}
```

## Practice: AlignBench Evaluation

### Dataset

```bash
mkdir -p ./data/subjective/

cd ./data/subjective
git clone https://github.com/THUDM/AlignBench.git

# data format conversion
python ../../../tools/convert_alignmentbench.py --mode json --jsonl data/data_release.jsonl

```

### Configuration

Please edit the config `configs/eval_subjective_alignbench.py` according to your demand.

### Evaluation

```bash
HF_EVALUATE_OFFLINE=1 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 python run.py workspace/eval_subjective_alignbench.py
```

### Submit to Official Leaderboard(Optional)

If you need to submit your prediction into official leaderboard, you can use `tools/convert_alignmentbench.py` for format conversion.

- Make sure you have the following results

```bash
outputs/
└── 20231214_173632
    ├── configs
    ├── logs
    ├── predictions # model's response
    ├── results
    └── summary
```

- Convert the data

```bash
python tools/convert_alignmentbench.py --mode csv --exp-folder outputs/20231214_173632
```

- Get `.csv`  in `submission/` for submission

```bash
outputs/
└── 20231214_173632
    ├── configs
    ├── logs
    ├── predictions
    ├── results
    ├── submission # 可提交文件
    └── summary
```
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
+								# Subjective Evaluation Guidance
 								## Introduction
 								Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								A popular evaluation method involves
 								- Compare Mode: comparing model responses pairwise to calculate their win rate
 								- Score Mode: another method involves calculate scores with single model response ([Chatbot Arena](https://chat.lmsys.org/)).
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								We support the use of GPT-4 (or other JudgeLLM) for the subjective evaluation of models based on above methods.
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Feature] Support AlpacaEval_V2 (#1006)

* support alpacaeval_v2

* support alpacaeval

* update docs

* update docs
											
										
										
											2024-03-28 16:49:04 +08:00
+								## Current Supported Subjective Evaluation Datasets
 . AlginBench (https://github.com/THUDM/AlignBench)
 . MTBench (https://github.com/lm-sys/FastChat)
 . AlpacaEvalv2 (https://github.com/tatsu-lab/alpaca_eval)
 . CompassArena (Internal dataset)
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								## Subjective Evaluation with Custom Dataset
 								The specific process includes:
 . Data preparation
 . Model response generation
 . Evaluate the response with a JudgeLLM
 . Generate JudgeLLM's response and calculate the metric
 								### Step-1: Data Preparation
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								We provide mini test-set for **Compare Mode** and **Score Mode** as below:
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								```python
 								###COREV2
 								[
 								    {
 								        "question": "如果我在空中垂直抛球，球最初向哪个方向行进？",
 								        "capability": "知识-社会常识",
 								        "others": {
 								            "question": "如果我在空中垂直抛球，球最初向哪个方向行进？",
 								            "evaluating_guidance": "",
 								            "reference_answer": "上"
 								        }
 								    },...]
 								###CreationV0.1
 								[
 								    {
 								        "question": "请你扮演一个邮件管家，我让你给谁发送什么主题的邮件，你就帮我扩充好邮件正文，并打印在聊天框里。你需要根据我提供的邮件收件人以及邮件主题，来斟酌用词，并使用合适的敬语。现在请给导师发送邮件，询问他是否可以下周三下午15:00进行科研同步会，大约200字。",
 								        "capability": "邮件通知",
 								        "others": ""
 								    },
 								```
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								The json must includes the following fields:
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
 								- 'question': Question description
 								- 'capability': The capability dimension of the question.
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								- 'others': Other needed information.
 								If you want to modify prompt on each single question, you can full some other information into 'others' and construct it.
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								### Step-2: Evaluation Configuration(Compare Mode)
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								For `config/eval_subjective_compare.py`, we provide some annotations to help users understand the configuration file.
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
 								```python
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								from mmengine.config import read_base
 								from opencompass.models import HuggingFaceCausalLM, HuggingFace, OpenAI
 								from opencompass.partitioners import NaivePartitioner
 								from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
 								from opencompass.runners import LocalRunner
 								from opencompass.runners import SlurmSequentialRunner
 								from opencompass.tasks import OpenICLInferTask
 								from opencompass.tasks.subjective_eval import SubjectiveEvalTask
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								from opencompass.summarizers import Corev2Summarizer
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								with read_base():
 								    # Pre-defined models
 								    from .models.qwen.hf_qwen_7b_chat import models as hf_qwen_7b_chat
 								    from .models.chatglm.hf_chatglm3_6b import models as hf_chatglm3_6b
 								    from .models.qwen.hf_qwen_14b_chat import models as hf_qwen_14b_chat
 								    from .models.openai.gpt_4 import models as gpt4_model
 								    from .datasets.subjective_cmp.subjective_corev2 import subjective_datasets
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								# Evaluation datasets
 								datasets = [*subjective_datasets]
 								# Model to be evaluated
 								models = [*hf_qwen_7b_chat, *hf_chatglm3_6b]
 								# Inference configuration
 								infer = dict(
 								    partitioner=dict(type=NaivePartitioner),
 								    runner=dict(
 								        type=SlurmSequentialRunner,
 								        partition='llmeval',
 								        quotatype='auto',
 								        max_num_workers=256,
 								        task=dict(type=OpenICLInferTask)),
 								)
 								# Evaluation configuration
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
+								eval = dict(
 								    partitioner=dict(
 								        type=SubjectiveNaivePartitioner,
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								        mode='m2n', # m-model v.s n-model
 								        # Under m2n setting
 								        # must specify base_models and compare_models, program will generate pairs between base_models compare_models.
 								        base_models = [*hf_qwen_14b_chat], # Baseline model
 								        compare_models = [*hf_baichuan2_7b, *hf_chatglm3_6b] # model to be evaluated
 								    ),
 								    runner=dict(
 								        type=SlurmSequentialRunner,
 								        partition='llmeval',
 								        quotatype='auto',
 								        max_num_workers=256,
 								        task=dict(
 								            type=SubjectiveEvalTask,
 								        judge_cfg=gpt4_model # Judge model
 								        )),
 								)
 								work_dir = './outputs/subjective/'
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
 								summarizer = dict(
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								    type=Corev2Summarizer,  # Custom summarizer
 								    match_method='smart', # Answer extraction
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
+								)
 								```
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								In addition, you can also change the response order of the two models, please refer to `config/eval_subjective_compare.py`,
-												[Feature] Add double order of subjective evaluation and removing duplicated response among two models (#692)

* add features

* add doc string

* add doc string
											
										
										
											2023-12-12 20:58:17 +08:00
+								when `infer_order` is setting to `random`, the response will be random ordered,
 								when `infer_order` is setting to `double`, the response of two models will be doubled in two ways.
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								### Step-2: Evaluation Configuration(Score Mode)
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								For `config/eval_subjective_score.py`, it is mainly same with `config/eval_subjective_compare.py`, and you just need to modify the eval mode to `singlescore`.
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								### Step-3: Launch the Evaluation
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
 								```shell
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								python run.py config/eval_subjective_score.py -r
-												[Doc] Update Subjective docs (#510)

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
											
										
										
											2023-10-27 16:27:24 +08:00
+								```
 								The `-r` parameter allows the reuse of model inference and GPT-4 evaluation results.
-												[Feature] Add Subjective Evaluation (#680)

* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
											
										
										
											2023-12-11 22:22:11 +08:00
+								The response of JudgeLLM will be output to `output/.../results/timestamp/xxmodel/xxdataset/.json`.
 								The evaluation report will be output to `output/.../summary/timestamp/report.csv`.
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
-												[Feature] Add JudgeLLMs (#710)

* add judgellms

* add judgellms

* add sub_size_partition

* add docs

* add ref
											
										
										
											2023-12-19 18:40:25 +08:00
+								Opencompass has supported lots of JudgeLLM, actually, you can take any model as JudgeLLM in opencompass configs.
 								And we list the popular open-source JudgeLLM here:
 . Auto-J, refer to `configs/models/judge_llm/auto_j`
 								Consider cite the following paper if you find it helpful:
 								```bibtex
 								@article{li2023generative,
 								  title={Generative judge for evaluating alignment},
 								  author={Li, Junlong and Sun, Shichao and Yuan, Weizhe and Fan, Run-Ze and Zhao, Hai and Liu, Pengfei},
 								  journal={arXiv preprint arXiv:2310.05470},
 								  year={2023}
 								}
 								@misc{2023opencompass,
 								    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
 								    author={OpenCompass Contributors},
 								    howpublished = {\url{https://github.com/open-compass/opencompass}},
 								    year={2023}
 								}
 								```
 . JudgeLM, refer to `configs/models/judge_llm/judgelm`
 								```bibtex
 								@article{zhu2023judgelm,
 								  title={JudgeLM: Fine-tuned Large Language Models are Scalable Judges},
 								  author={Zhu, Lianghui and Wang, Xinggang and Wang, Xinlong},
 								  journal={arXiv preprint arXiv:2310.17631},
 								  year={2023}
 								}
 								@misc{2023opencompass,
 								    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
 								    author={OpenCompass Contributors},
 								    howpublished = {\url{https://github.com/open-compass/opencompass}},
 								    year={2023}
 								}
 								```
 . PandaLM, refer to `configs/models/judge_llm/pandalm`
 								Consider cite the following paper if you find it helpful:
 								```bibtex
 								@article{wang2023pandalm,
 								  title={PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization},
 								  author={Wang, Yidong and Yu, Zhuohao and Zeng, Zhengran and Yang, Linyi and Wang, Cunxiang and Chen, Hao and Jiang, Chaoya and Xie, Rui and Wang, Jindong and Xie, Xing and others},
 								  journal={arXiv preprint arXiv:2306.05087},
 								  year={2023}
 								}
 								@misc{2023opencompass,
 								    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
 								    author={OpenCompass Contributors},
 								    howpublished = {\url{https://github.com/open-compass/opencompass}},
 								    year={2023}
 								}
 								```
-												[fix] add different temp for different question in mtbench (#954)

* add temp for mtbench

* add document for mtbench

* add document for mtbench
											
										
										
											2024-03-11 17:24:39 +08:00
+								## Multi-round Subjective Evaluation in OpenCompass
 								In OpenCompass, we also support subjective multi-turn dialogue evaluation. For instance, the evaluation of MT-Bench can be referred to in `configs/eval_subjective_mtbench.py`.
 								In the multi-turn dialogue evaluation, you need to organize the data format into the following dialogue structure:
 								```
 								"dialogue": [
 								    {
 								        "role": "user",
 								        "content": "Imagine you are participating in a race with a group of people. If you have just overtaken the second person, what's your current position? Where is the person you just overtook?"
 								    },
 								    {
 								        "role": "assistant",
 								        "content": ""
 								    },
 								    {
 								        "role": "user",
 								        "content": "If the \"second person\" is changed to \"last person\" in the above question, what would the answer be?"
 								    },
 								    {
 								        "role": "assistant",
 								        "content": ""
 								    }
 								],
 								```
 								It's important to note that due to the different question types in MTBench having different temperature settings, we need to divide the original data files into three different subsets according to the temperature for separate inference. For different subsets, we can set different temperatures. For specific settings, please refer to `configs\datasets\subjective\multiround\mtbench_single_judge_diff_temp.py`.
 								Consider cite the following paper if you find it helpful:
 								```bibtex
 								@misc{zheng2023judging,
 								      title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
 								      author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
 								      year={2023},
 								      eprint={2306.05685},
 								      archivePrefix={arXiv},
 								      primaryClass={cs.CL}
 								}
 								@misc{2023opencompass,
 								    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
 								    author={OpenCompass Contributors},
 								    howpublished = {\url{https://github.com/open-compass/opencompass}},
 								    year={2023}
 								}
 								```
-												[Doc] Update Doc for Alignbench (#707)

* update alignmentbench

* update alignmentbench

* update doc

* update

* update
											
										
										
											2023-12-15 15:07:25 +08:00
+								## Practice: AlignBench Evaluation
 								### Dataset
 								```bash
 								mkdir -p ./data/subjective/
 								cd ./data/subjective
 								git clone https://github.com/THUDM/AlignBench.git
 								# data format conversion
 								python ../../../tools/convert_alignmentbench.py --mode json --jsonl data/data_release.jsonl
 								```
 								### Configuration
 								Please edit the config `configs/eval_subjective_alignbench.py` according to your demand.
 								### Evaluation
 								```bash
 								HF_EVALUATE_OFFLINE=1 HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 python run.py workspace/eval_subjective_alignbench.py
 								```
 								### Submit to Official Leaderboard(Optional)
 								If you need to submit your prediction into official leaderboard, you can use `tools/convert_alignmentbench.py` for format conversion.
 								- Make sure you have the following results
 								```bash
 								outputs/
 								└── 20231214_173632
 								    ├── configs
 								    ├── logs
 								    ├── predictions # model's response
 								    ├── results
 								    └── summary
 								```
 								- Convert the data
 								```bash
 								python tools/convert_alignmentbench.py --mode csv --exp-folder outputs/20231214_173632
 								```
 								- Get `.csv`  in `submission/` for submission
 								```bash
 								outputs/
 								└── 20231214_173632
 								    ├── configs
 								    ├── logs
 								    ├── predictions
 								    ├── results
 								    ├── submission # 可提交文件
 								    └── summary
 								```