diff --git a/docs/en/index.rst b/docs/en/index.rst index a92a09a8..b49c836c 100644 --- a/docs/en/index.rst +++ b/docs/en/index.rst @@ -35,6 +35,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass. user_guides/models.md user_guides/evaluation.md user_guides/experimentation.md + user_guides/metrics.md .. _AdvancedGuides: .. toctree:: diff --git a/docs/en/user_guides/metrics.md b/docs/en/user_guides/metrics.md index 9eb8bc83..3fe85b2d 100644 --- a/docs/en/user_guides/metrics.md +++ b/docs/en/user_guides/metrics.md @@ -1 +1,62 @@ # Metric Calculation + +In the evaluation phase, we typically select the corresponding evaluation metric strategy based on the characteristics of the dataset itself. The main criterion is the **type of standard answer**, generally including the following types: + +- **Choice**: Common in classification tasks, judgment questions, and multiple-choice questions. Currently, this type of question dataset occupies the largest proportion, with datasets such as MMLU, CEval, etc. Accuracy is usually used as the evaluation standard-- `ACCEvaluator`. +- **Phrase**: Common in Q&A and reading comprehension tasks. This type of dataset mainly includes CLUE_CMRC, CLUE_DRCD, DROP datasets, etc. Matching rate is usually used as the evaluation standard--`EMEvaluator`. +- **Sentence**: Common in translation and generating pseudocode/command-line tasks, mainly including Flores, Summscreen, Govrepcrs, Iwdlt2017 datasets, etc. BLEU (Bilingual Evaluation Understudy) is usually used as the evaluation standard--`BleuEvaluator`. +- **Paragraph**: Common in text summary generation tasks, commonly used datasets mainly include Lcsts, TruthfulQA, Xsum datasets, etc. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is usually used as the evaluation standard--`RougeEvaluator`. +- **Code**: Common in code generation tasks, commonly used datasets mainly include Humaneval, MBPP datasets, etc. Execution pass rate and `pass@k` are usually used as the evaluation standard. At present, Opencompass supports `MBPPEvaluator` and `HumanEvaluator`. + +There is also a type of **scoring-type** evaluation task without standard answers, such as judging whether the output of a model is toxic, which can directly use the related API service for scoring. At present, it supports `ToxicEvaluator`, and currently, the realtoxicityprompts dataset uses this evaluation method. + +## Supported Evaluation Metrics + +Currently, in OpenCompass, commonly used Evaluators are mainly located in the [`opencompass/openicl/icl_evaluator`](https://github.com/InternLM/opencompass/tree/main/opencompass/openicl/icl_evaluator) folder. There are also some dataset-specific indicators that are placed in parts of [`opencompass/datasets`](https://github.com/InternLM/opencompass/tree/main/opencompass/datasets). Below is a summary: + +| Evaluation Strategy | Evaluation Metrics | Common Postprocessing Method | Datasets | +| ------------------- | -------------------- | ---------------------------- | -------------------------------------------------------------------- | +| `ACCEvaluator` | Accuracy | `first_capital_postprocess` | agieval, ARC, bbh, mmlu, ceval, commonsenseqa, crowspairs, hellaswag | +| `EMEvaluator` | Match Rate | None, dataset-specific | drop, CLUE_CMRC, CLUE_DRCD | +| `BleuEvaluator` | BLEU | None, `flores` | flores, iwslt2017, summscreen, govrepcrs | +| `RougeEvaluator` | ROUGE | None, dataset-specific | lcsts, truthfulqa, Xsum, XLSum | +| `HumanEvaluator` | pass@k | `humaneval_postprocess` | humaneval_postprocess | +| `MBPPEvaluator` | Execution Pass Rate | None | mbpp | +| `ToxicEvaluator` | PerspectiveAPI | None | realtoxicityprompts | +| `AGIEvalEvaluator` | Accuracy | None | agieval | +| `AUCROCEvaluator` | AUC-ROC | None | jigsawmultilingual, civilcomments | +| `MATHEvaluator` | Accuracy | `math_postprocess` | math | +| `MccEvaluator` | Matthews Correlation | None | -- | +| `SquadEvaluator` | F1-scores | None | -- | + +## How to Configure + +The evaluation standard configuration is generally placed in the dataset configuration file, and the final xxdataset_eval_cfg will be passed to `dataset.infer_cfg` as an instantiation parameter. + +Below is the definition of `govrepcrs_eval_cfg`, and you can refer to [configs/datasets/govrepcrs](https://github.com/InternLM/opencompass/tree/main/configs/datasets/govrepcrs). + +```python +from opencompass.openicl.icl_evaluator import BleuEvaluator +from opencompass.datasets import GovRepcrsDataset +from opencompass.utils.text_postprocessors import general_cn_postprocess + +govrepcrs_reader_cfg = dict(.......) +govrepcrs_infer_cfg = dict(.......) + +# Configuration of evaluation metrics +govrepcrs_eval_cfg = dict( + evaluator=dict(type=BleuEvaluator), # Use the common translator evaluator BleuEvaluator + pred_role='BOT', # Accept 'BOT' role output + pred_postprocessor=dict(type=general_cn_postprocess), # Postprocessing of prediction results + dataset_postprocessor=dict(type=general_cn_postprocess)) # Postprocessing of dataset standard answers + +govrepcrs_datasets = [ + dict( + type=GovRepcrsDataset, # Dataset class name + path='./data/govrep/', # Dataset path + abbr='GovRepcrs', # Dataset alias + reader_cfg=govrepcrs_reader_cfg, # Dataset reading configuration file, configure its reading split, column, etc. + infer_cfg=govrepcrs_infer_cfg, # Dataset inference configuration file, mainly related to prompt + eval_cfg=govrepcrs_eval_cfg) # Dataset result evaluation configuration file, evaluation standard, and preprocessing and postprocessing. +] +``` diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst index 7164d929..746fd360 100644 --- a/docs/zh_cn/index.rst +++ b/docs/zh_cn/index.rst @@ -36,6 +36,7 @@ OpenCompass 上手路线 user_guides/models.md user_guides/evaluation.md user_guides/experimentation.md + user_guides/metrics.md .. _提示词: .. toctree:: diff --git a/docs/zh_cn/user_guides/metrics.md b/docs/zh_cn/user_guides/metrics.md index 3c2f7e68..93f43b5e 100644 --- a/docs/zh_cn/user_guides/metrics.md +++ b/docs/zh_cn/user_guides/metrics.md @@ -1,3 +1,62 @@ # 评估指标 -Coming soon. +在评测阶段,我们一般以数据集本身的特性来选取对应的评估策略,最主要的依据为**标准答案的类型**,一般以下几种类型: + +- **选项**:常见于分类任务,判断题以及选择题,目前这类问题的数据集占比最大,有 MMLU, CEval数据集等等,评估标准一般使用准确率--`ACCEvaluator`。 +- **短语**:常见于问答以及阅读理解任务,这类数据集主要包括CLUE_CMRC, CLUE_DRCD, DROP数据集等等,评估标准一般使用匹配率--`EMEvaluator`。 +- **句子**:常见于翻译以及生成伪代码、命令行任务中,主要包括Flores, Summscreen, Govrepcrs, Iwdlt2017数据集等等,评估标准一般使用BLEU(Bilingual Evaluation Understudy)--`BleuEvaluator`。 +- **段落**:常见于文本摘要生成的任务,常用的数据集主要包括Lcsts, TruthfulQA, Xsum数据集等等,评估标准一般使用ROUGE(Recall-Oriented Understudy for Gisting Evaluation)--`RougeEvaluator`。 +- **代码**:常见于代码生成的任务,常用的数据集主要包括Humaneval,MBPP数据集等等,评估标准一般使用执行通过率以及 `pass@k`,目前 Opencompass 支持的有`MBPPEvaluator`、`HumanEvaluator`。 + +还有一类**打分类型**评测任务没有标准答案,比如评判一个模型的输出是否存在有毒,可以直接使用相关 API 服务进行打分,目前支持的有 `ToxicEvaluator`,目前有 realtoxicityprompts 数据集使用此评测方式。 + +## 已支持评估指标 + +目前 OpenCompass 中,常用的 Evaluator 主要放在 [`opencompass/openicl/icl_evaluator`](https://github.com/InternLM/opencompass/tree/main/opencompass/openicl/icl_evaluator)文件夹下, 还有部分数据集特有指标的放在 [`opencompass/datasets`](https://github.com/InternLM/opencompass/tree/main/opencompass/datasets) 的部分文件中。以下是汇总: + +| 评估指标 | 评估策略 | 常用后处理方式 | 数据集 | +| ------------------ | -------------------- | --------------------------- | -------------------------------------------------------------------- | +| `ACCEvaluator` | 正确率 | `first_capital_postprocess` | agieval, ARC, bbh, mmlu, ceval, commonsenseqa, crowspairs, hellaswag | +| `EMEvaluator` | 匹配率 | None, dataset_specification | drop, CLUE_CMRC, CLUE_DRCD | +| `BleuEvaluator` | BLEU | None, `flores` | flores, iwslt2017, summscreen, govrepcrs | +| `RougeEvaluator` | ROUGE | None, dataset_specification | lcsts, truthfulqa, Xsum, XLSum | +| `HumanEvaluator` | pass@k | `humaneval_postprocess` | humaneval_postprocess | +| `MBPPEvaluator` | 执行通过率 | None | mbpp | +| `ToxicEvaluator` | PerspectiveAPI | None | realtoxicityprompts | +| `AGIEvalEvaluator` | 正确率 | None | agieval | +| `AUCROCEvaluator` | AUC-ROC | None | jigsawmultilingual, civilcomments | +| `MATHEvaluator` | 正确率 | `math_postprocess` | math | +| `MccEvaluator` | Matthews Correlation | None | -- | +| `SquadEvaluator` | F1-scores | None | -- | + +## 如何配置 + +评估标准配置一般放在数据集配置文件中,最终的 xxdataset_eval_cfg 会传给 `dataset.infer_cfg` 作为实例化的一个参数。 + +下面是 `govrepcrs_eval_cfg` 的定义, 具体可查看 [configs/datasets/govrepcrs](https://github.com/InternLM/opencompass/tree/main/configs/datasets/govrepcrs)。 + +```python +from opencompass.openicl.icl_evaluator import BleuEvaluator +from opencompass.datasets import GovRepcrsDataset +from opencompass.utils.text_postprocessors import general_cn_postprocess + +govrepcrs_reader_cfg = dict(.......) +govrepcrs_infer_cfg = dict(.......) + +# 评估指标的配置 +govrepcrs_eval_cfg = dict( + evaluator=dict(type=BleuEvaluator), # 使用常用翻译的评估器BleuEvaluator + pred_role='BOT', # 接受'BOT' 角色的输出 + pred_postprocessor=dict(type=general_cn_postprocess), # 预测结果的后处理 + dataset_postprocessor=dict(type=general_cn_postprocess)) # 数据集标准答案的后处理 + +govrepcrs_datasets = [ + dict( + type=GovRepcrsDataset, # 数据集类名 + path='./data/govrep/', # 数据集路径 + abbr='GovRepcrs', # 数据集别名 + reader_cfg=govrepcrs_reader_cfg, # 数据集读取配置文件,配置其读取的split,列等 + infer_cfg=govrepcrs_infer_cfg, # 数据集推理的配置文件,主要 prompt 相关 + eval_cfg=govrepcrs_eval_cfg) # 数据集结果的评估配置文件,评估标准以及前后处理。 +] +```