mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00

* stash files * compassbench subjective evaluation added * evaluation update * remove unneeded content * fix lint * update docs * Update lint * Update --------- Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn>
7.5 KiB
7.5 KiB
CompassBench 介绍
CompassBench 2.0 v1.3 版本
CompassBench(官方自建榜单)经历了多次更新迭代,从2024年7月起,OpenCompass将会公布自建榜单的评测规则(评测配置文件)和示例数据集文件,以帮助社区更好的了解自建榜单的评测逻辑和方法。
能力维度
2024年8月榜单将会包括以下能力维度:
能力 | 任务介绍 | 评测方式 | 示例数据地址 |
---|---|---|---|
语言 | 评测模型在信息抽取、信息抽取、内容总结、对话、创作等多种任务上的能力 | 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/language |
推理 | 评测模型在逻辑推理、常识推理、表格推理等多种日常推理任务上的能力 | 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/reasoning |
知识 | 评测模型在理科、工科、人文社科等多个领域的知识水平 | 客观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/knowledge |
数学 | 评测模型在数值计算、高中及大学难度的数学问题上的能力 | 客观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/math |
代码 | 评测模型在代码生成、代码补全、代码注释、代码重构、代码改写、计算机知识综合问答上的能力 | 客观评测 + 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/code |
指令跟随 | 评测模型在基于各类语言、推理、知识等任务中,能否准确遵循复杂指令的能力 | 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/instruct |
智能体 | 评估模型在复杂工具调用的能力,以及数据科学、数学等情况下使用代码解释器的能力 | 客观评测 | https://github.com/open-compass/T-Eval https://github.com/open-compass/CIBench |
评测方法
- 对于客观评测,将会采用0-shot + CoT的方式评测。
- OpenCompass在客观题评测的后处理上已进行较多优化,并在评测时在Prompt中对回答格式进行约束,对于因指令跟随问题带来的无法完成答案提取的情况,将视为回答错误
- 数学、智能体题目类型与给定的示例数据类似,但真实评测数据与开源数据不同
- 对于主观评测,将会采用基于大模型评价的方式进行评测。
-
我们对每一道问题均提供评测时的打分指引。
-
比较待测模型相对于参考回复的胜率,共设置为五档
A++
:回答A远胜于回答B。A+
:回答A略优于回答B。A=B
:回答A和回答B质量相同。B+
:回答B略优于回答A。B++
:回答B远胜于回答A。
-
- 主观评测配置文件
- 主观评价提示词
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the \
responses generated by two AI models.
We will provide you with the user query and a pair of AI-generated \
responses (Response A and Response B).
You should first read the user query and the conversation history \
carefully for analyzing the task, and then evaluate the quality of the \
responses based on and rules provided below.
# Conversation between User and AI
## User Query
<|begin_of_query|>
{question}
<|end_of_query|>
## Response A
<|begin_of_response_A|>
{prediction}
<|end_of_response_A|>
## Response B
<|begin_of_response_B|>
{prediction2}
<|end_of_response_B|>
# Evaluation
## Checklist
<|begin_of_checklist|>
{checklist}
<|end_of_checklist|>
Please use this checklist to guide your evaluation, but do not limit your \
assessment to the checklist.
## Rules
You should compare the above two responses based on your analysis of the \
user queries and the conversation history.
You should first write down your analysis and the checklist that you used \
for the evaluation, and then provide your assessment according to the \
checklist.
There are five choices to give your final assessment: ["A++", "A+", \
"A=B", "B+", "B++"], which correspond to the following meanings:
- `A++`: Response A is much better than Response B.
- `A+`: Response A is only slightly better than Response B.
- `A=B`: Response A and B are of the same quality. Please use this \
choice sparingly.
- `B+`: Response B is only slightly better than Response A.
- `B++`: Response B is much better than Response A.
## Output Format
First, please output your analysis for each model response, and \
then summarize your assessment to three aspects: "reason A=B", \
"reason A>B", and "reason B>A", and finally make your choice for \
the final assessment.
Please provide your evaluation results in the following json \
format by filling in the placeholders in []:
{
"analysis of A": "[analysis of Response A]",
"analysis of B": "[analysis of Response B]",
"reason of A=B": "[where Response A and B perform equally well]",
"reason of A>B": "[where Response A is better than Response B]",
"reason of B>A": "[where Response B is better than Response A]",
"choice": "[A++ or A+ or A=B or B+ or B++]",
}
# 指令
您是一位专业评估专家。您的任务是评估两个AI模型生成回答的质量。
我们将为您提供用户问题及一对AI生成的回答(回答A和回答B)。
您应当首先仔细阅读用户问题,然后根据以下提供的规则评估回答的质量。
# 用户与AI之间的对话
## 用户问题
<|begin_of_query|>
{question}
<|end_of_query|>
## 回答A
<|begin_of_response_A|>
{prediction}
<|end_of_response_A|>
## 回答B
<|begin_of_response_B|>
{prediction2}
<|end_of_response_B|>
# 评估
## 检查清单
<|begin_of_checklist|>
{checklist}
<|end_of_checklist|>
请参考此检查清单来评估回答的质量,但不要局限于此检查清单。
## 规则
您应当基于用户查询,分析比较上述两种回答。
您应当基于检查清单写下您的分析,然后提供您的评价。
有五个选项供您做出最终评估:["A++", "A+", "A=B", "B+", "B++"],它们对应如下含义:
- `A++`:回答A远胜于回答B。
- `A+`:回答A略优于回答B。
- `A=B`:回答A和回答B质量相同。请谨慎使用此选项。
- `B+`:回答B略优于回答A。
- `B++`:回答B远胜于回答A。
## 输出格式
首先,请输出您对每个模型回答的分析,
然后总结您的评估到三个方面:"A=B的理由","A优于B的理由",和 "B优于A的理由",
最后做出您对最终评估的选择。
请按照以下json格式提供您的评估结果,通过填充[]中的占位符:
{
"回答A的分析": "[回答A的分析]",
"回答B的分析": "[回答B的分析]",
"A=B的理由": "[A和B回答差不多的理由]",
"A优于B的理由": "[回答A优于B的理由]",
"B优于A的理由": "[回答B优于A的理由]",
"choice": "[A++ or A+ or A=B or B+ or B++]",
}