mirror of https://github.com/open-compass/opencompass.git synced 2025-05-30 16:03:24 +08:00

* rename

* add en subdoc

* fix name

* fix writing

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>

2023-10-27 16:27:24 +08:00

6.1 KiB

Raw Blame History

Subjective Evaluation Guidance

Introduction

Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.

To explore the model's subjective capabilities, we employ state-of-the-art LLM (GPT-4) as a substitute for human assessors (LLM-as-a-Judge).

A popular evaluation method involves comparing model responses pairwise to calculate their win rate (Chatbot Arena).

We support the use of GPT-4 for the subjective evaluation of models based on this method.

Data Preparation

We provide a demo test set subjective_demo.xlsx based on z-bench.

Store the set of subjective questions in .xlsx format in the data/subjective/directory.

The table includes the following fields:

'question': Question description
'index': Question number
'reference_answer': Reference answer
'evaluating_guidance': Evaluation guidance
'capability': The capability dimension of the question.

Evaluation Configuration

The specific process includes:

Model response reasoning
GPT-4 evaluation comparisons
Generating evaluation reports

For config/subjective.py, we provide some annotations to help users understand the configuration file's meaning.

# Import datasets and subjective evaluation summarizer
from mmengine.config import read_base
with read_base():
    from .datasets.subjective_cmp.subjective_cmp import subjective_datasets
    from .summarizers.subjective import summarizer

datasets = [*subjective_datasets]

from opencompass.models import HuggingFaceCausalLM, HuggingFace, OpenAI

# Import partitioner and task required for subjective evaluation
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks.subjective_eval import SubjectiveEvalTask


# Define model configurations for inference and evaluation
# Including the inference models chatglm2-6b, qwen-7b-chat, internlm-chat-7b, and the evaluation model gpt4
models = [...]

api_meta_template = dict(
    round=[
        dict(role='HUMAN', api_role='HUMAN'),
        dict(role='BOT', api_role='BOT', generate=True)
    ],
    reserved_roles=[
        dict(role='SYSTEM', api_role='SYSTEM'),
    ],
)

# Define the configuration for subjective evaluation
eval = dict(
    partitioner=dict(
        type=SubjectiveNaivePartitioner,
        mode='all',  # alternately constructs two for comparisons
    ),
    runner=dict(
        type=LocalRunner,
        max_num_workers=2,  # Supports parallel comparisons
        task=dict(
            type=SubjectiveEvalTask,  # Used to read inputs for a pair of models
            judge_cfg=dict(
                abbr='GPT4',
                type=OpenAI,
                path='gpt-4-0613',
                key='ENV',
                meta_template=api_meta_template,
                query_per_second=1,
                max_out_len=2048,
                max_seq_len=2048,
                batch_size=2),
        )),
)

Launching the Evaluation

python run.py config/subjective.py -r

The -r parameter allows the reuse of model inference and GPT-4 evaluation results.

Evaluation Report

The evaluation report will be output to output/.../summary/timestamp/report.md, which includes win rate statistics, battle scores, and ELO ratings. The specific format is as follows:

# Subjective Analysis

A total of 30 comparisons, of which 30 comparisons are meaningful (A / B answers inconsistent)
A total of 30 answer comparisons, successfully extracted 30 answers from GPT-4 replies, with an extraction success rate of 100.00%

### Basic statistics (4 stats: win / tie / lose / not bad)

| Dimension \ Stat [W / T / L / NB] | chatglm2-6b-hf                | qwen-7b-chat-hf              | internlm-chat-7b-hf           |
| --------------------------------- | ----------------------------- | ---------------------------- | ----------------------------- |
| LANG: Overall                     | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |
| LANG: CN                          | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |
| LANG: EN                          | N/A                           | N/A                          | N/A                           |
| CAPA: common                      | 30.0% / 40.0% / 30.0% / 30.0% | 50.0% / 0.0% / 50.0% / 50.0% | 30.0% / 40.0% / 30.0% / 30.0% |

![Capabilities Dimension Classification Result](by_capa.png)

![Language Classification  Result](by_lang.png)

### Model scores (base score is 0, win +3, both +1, neither -1, lose -3)

| Dimension \ Score | chatglm2-6b-hf | qwen-7b-chat-hf | internlm-chat-7b-hf |
| ----------------- | -------------- | --------------- | ------------------- |
| LANG: Overall     | -8             | 0               | -8                  |
| LANG: CN          | -8             | 0               | -8                  |
| LANG: EN          | N/A            | N/A             | N/A                 |
| CAPA: common      | -8             | 0               | -8                  |

### Bootstrap ELO, Median of n=1000 times

|                  | chatglm2-6b-hf | internlm-chat-7b-hf | qwen-7b-chat-hf |
| ---------------- | -------------- | ------------------- | --------------- |
| elo_score [Mean] | 999.504        | 999.912             | 1000.26         |
| elo_score [Std]  | 0.621362       | 0.400226            | 0.694434        |

For comparing the evaluation of models A and B, there are four choices:

A is better than B.
A and B are equally good.
A is worse than B.
Neither A nor B is good.

So, win / tie / lose / not bad represent the proportions of the model winning / tying / losing / winning or being equally good, respectively.

Bootstrap ELO is calculated as the median ELO score by comparing match results through 1000 random permutations.

6.1 KiB Raw Blame History