
* rename * add en subdoc * fix name * fix writing * update --------- Co-authored-by: Leymore <zfz-960727@163.com>
6.1 KiB
Subjective Evaluation Guidance
Introduction
Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.
To explore the model's subjective capabilities, we employ state-of-the-art LLM (GPT-4) as a substitute for human assessors (LLM-as-a-Judge).
A popular evaluation method involves comparing model responses pairwise to calculate their win rate (Chatbot Arena).
We support the use of GPT-4 for the subjective evaluation of models based on this method.
Data Preparation
We provide a demo test set subjective_demo.xlsx based on z-bench.
Store the set of subjective questions in .xlsx format in the data/subjective/directory
.
The table includes the following fields:
- 'question': Question description
- 'index': Question number
- 'reference_answer': Reference answer
- 'evaluating_guidance': Evaluation guidance
- 'capability': The capability dimension of the question.
Evaluation Configuration
The specific process includes:
- Model response reasoning
- GPT-4 evaluation comparisons
- Generating evaluation reports
For config/subjective.py
, we provide some annotations to help users understand the configuration file's meaning.
# Import datasets and subjective evaluation summarizer
from mmengine.config import read_base
with read_base():
from .datasets.subjective_cmp.subjective_cmp import subjective_datasets
from .summarizers.subjective import summarizer
datasets = [*subjective_datasets]
from opencompass.models import HuggingFaceCausalLM, HuggingFace, OpenAI
# Import partitioner and task required for subjective evaluation
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
# Define model configurations for inference and evaluation
# Including the inference models chatglm2-6b, qwen-7b-chat, internlm-chat-7b, and the evaluation model gpt4
models = [...]
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True)
],
reserved_roles=[
dict(role='SYSTEM', api_role='SYSTEM'),
],
)
# Define the configuration for subjective evaluation
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
mode='all', # alternately constructs two for comparisons
),
runner=dict(
type=LocalRunner,
max_num_workers=2, # Supports parallel comparisons
task=dict(
type=SubjectiveEvalTask, # Used to read inputs for a pair of models
judge_cfg=dict(
abbr='GPT4',
type=OpenAI,
path='gpt-4-0613',
key='ENV',
meta_template=api_meta_template,
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=2),
)),
)
Launching the Evaluation
python run.py config/subjective.py -r
The -r
parameter allows the reuse of model inference and GPT-4 evaluation results.
Evaluation Report
The evaluation report will be output to output/.../summary/timestamp/report.md
, which includes win rate statistics, battle scores, and ELO ratings. The specific format is as follows:
# Subjective Analysis
A total of 30 comparisons, of which 30 comparisons are meaningful (A / B answers inconsistent)
A total of 30 answer comparisons, successfully extracted 30 answers from GPT-4 replies, with an extraction success rate of 100.00%
### Basic statistics (4 stats: win / tie / lose / not bad)
| Dimension \ Stat [W / T / L / NB] | chatglm2-6b-hf | qwen-7b-chat-hf | internlm-chat-7b-hf |
| --------------------------------- | ----------------------------- | ---------------------------- | ----------------------------- |
| LANG: Overall | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |
| LANG: CN | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |
| LANG: EN | N/A | N/A | N/A |
| CAPA: common | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |


### Model scores (base score is 0, win +3, both +1, neither -1, lose -3)
| Dimension \ Score | chatglm2-6b-hf | qwen-7b-chat-hf | internlm-chat-7b-hf |
| ----------------- | -------------- | --------------- | ------------------- |
| LANG: Overall | -8 | 0 | -8 |
| LANG: CN | -8 | 0 | -8 |
| LANG: EN | N/A | N/A | N/A |
| CAPA: common | -8 | 0 | -8 |
### Bootstrap ELO, Median of n=1000 times
| | chatglm2-6b-hf | internlm-chat-7b-hf | qwen-7b-chat-hf |
| ---------------- | -------------- | ------------------- | --------------- |
| elo_score [Mean] | 999.504 | 999.912 | 1000.26 |
| elo_score [Std] | 0.621362 | 0.400226 | 0.694434 |
For comparing the evaluation of models A and B, there are four choices:
- A is better than B.
- A and B are equally good.
- A is worse than B.
- Neither A nor B is good.
So, win
/ tie
/ lose
/ not bad
represent the proportions of the model winning / tying / losing / winning or being equally good, respectively.
Bootstrap ELO
is calculated as the median ELO score by comparing match results through 1000 random permutations.