[Doc] Update dataset list (#437)

* add new dataset list

* add new dataset list

* add new dataset list

* update

* update

* update readme

---------

Co-authored-by: gaotongxiao <gaotongxiao@gmail.com>
This commit is contained in:
Leymore 2023-09-27 15:02:09 +08:00 committed by GitHub
parent dc1b82c346
commit d6261e109d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 628 additions and 486 deletions

558
README.md
View File

@ -34,9 +34,10 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2023.09.26\]** We update the leaderboard with [Qwen](https://github.com/QwenLM/Qwen), one of the best-performing open-source models currently available, welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
- **\[2023.09.20\]** We update the leaderboard with [InternLM-20B](https://github.com/InternLM/InternLM), welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md). 🔥🔥🔥.
- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md).
- **\[2023.09.08\]** We update the leaderboard with Baichuan-2/Tigerbot-2/Vicuna-v1.5, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.06\]** [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
- **\[2023.09.02\]** We have supported the evaluation of [Qwen-VL](https://github.com/QwenLM/Qwen-VL) in OpenCompass.
@ -51,7 +52,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features includes:
- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 50+ datasets with about 300,000 questions, comprehensively evaluating the capabilities of the models in five dimensions.
- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 70+ datasets with about 400,000 questions, comprehensively evaluating the capabilities of the models in five dimensions.
- **Efficient distributed evaluation**: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours.
@ -67,247 +68,6 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
<p align="right"><a href="#top">🔝Back to top</a></p>
## 📖 Dataset Support
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>Language</b>
</td>
<td>
<b>Knowledge</b>
</td>
<td>
<b>Reasoning</b>
</td>
<td>
<b>Comprehensive Examination</b>
</td>
<td>
<b>Understanding</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>Word Definition</b></summary>
- WiC
- SummEdits
</details>
<details open>
<summary><b>Idiom Learning</b></summary>
- CHID
</details>
<details open>
<summary><b>Semantic Similarity</b></summary>
- AFQMC
- BUSTM
</details>
<details open>
<summary><b>Coreference Resolution</b></summary>
- CLUEWSC
- WSC
- WinoGrande
</details>
<details open>
<summary><b>Translation</b></summary>
- Flores
</details>
</td>
<td>
<details open>
<summary><b>Knowledge Question Answering</b></summary>
- BoolQ
- CommonSenseQA
- NaturalQuestion
- TrivialQA
</details>
<details open>
<summary><b>Multi-language Question Answering</b></summary>
- TyDi-QA
</details>
</td>
<td>
<details open>
<summary><b>Textual Entailment</b></summary>
- CMNLI
- OCNLI
- OCNLI_FC
- AX-b
- AX-g
- CB
- RTE
</details>
<details open>
<summary><b>Commonsense Reasoning</b></summary>
- StoryCloze
- StoryCloze-CN (coming soon)
- COPA
- ReCoRD
- HellaSwag
- PIQA
- SIQA
</details>
<details open>
<summary><b>Mathematical Reasoning</b></summary>
- MATH
- GSM8K
</details>
<details open>
<summary><b>Theorem Application</b></summary>
- TheoremQA
</details>
<details open>
<summary><b>Code</b></summary>
- HumanEval
- MBPP
</details>
<details open>
<summary><b>Comprehensive Reasoning</b></summary>
- BBH
</details>
</td>
<td>
<details open>
<summary><b>Junior High, High School, University, Professional Examinations</b></summary>
- GAOKAO-2023
- CEval
- AGIEval
- MMLU
- GAOKAO-Bench
- CMMLU
- ARC
</details>
</td>
<td>
<details open>
<summary><b>Reading Comprehension</b></summary>
- C3
- CMRC
- DRCD
- MultiRC
- RACE
</details>
<details open>
<summary><b>Content Summary</b></summary>
- CSL
- LCSTS
- XSum
</details>
<details open>
<summary><b>Content Analysis</b></summary>
- EPRSTMT
- LAMBADA
- TNEWS
</details>
</td>
</tr>
</td>
</tr>
</tbody>
</table>
<p align="right"><a href="#top">🔝Back to top</a></p>
## 📖 Model Support
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>Open-source Models</b>
</td>
<td>
<b>API Models</b>
</td>
<!-- <td>
<b>Custom Models</b>
</td> -->
</tr>
<tr valign="top">
<td>
- InternLM
- LLaMA
- Vicuna
- Alpaca
- Baichuan
- WizardLM
- ChatGLM-6B
- ChatGLM2-6B
- MPT
- Falcon
- TigerBot
- MOSS
- ...
</td>
<td>
- OpenAI
- Claude (coming soon)
- PaLM (coming soon)
- ……
</td>
<!--
- GLM
- ...
</td> -->
</tr>
</tbody>
</table>
## 🛠️ Installation
Below are the steps for quick installation and datasets preparation.
@ -360,6 +120,316 @@ python run.py --datasets ceval_ppl mmlu_ppl \
Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started.html) to learn how to run an evaluation task.
<p align="right"><a href="#top">🔝Back to top</a></p>
## 📖 Dataset Support
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>Language</b>
</td>
<td>
<b>Knowledge</b>
</td>
<td>
<b>Reasoning</b>
</td>
<td>
<b>Examination</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>Word Definition</b></summary>
- WiC
- SummEdits
</details>
<details open>
<summary><b>Idiom Learning</b></summary>
- CHID
</details>
<details open>
<summary><b>Semantic Similarity</b></summary>
- AFQMC
- BUSTM
</details>
<details open>
<summary><b>Coreference Resolution</b></summary>
- CLUEWSC
- WSC
- WinoGrande
</details>
<details open>
<summary><b>Translation</b></summary>
- Flores
- IWSLT2017
</details>
<details open>
<summary><b>Multi-language Question Answering</b></summary>
- TyDi-QA
- XCOPA
</details>
<details open>
<summary><b>Multi-language Summary</b></summary>
- XLSum
</details>
</td>
<td>
<details open>
<summary><b>Knowledge Question Answering</b></summary>
- BoolQ
- CommonSenseQA
- NaturalQuestions
- TriviaQA
</details>
</td>
<td>
<details open>
<summary><b>Textual Entailment</b></summary>
- CMNLI
- OCNLI
- OCNLI_FC
- AX-b
- AX-g
- CB
- RTE
- ANLI
</details>
<details open>
<summary><b>Commonsense Reasoning</b></summary>
- StoryCloze
- COPA
- ReCoRD
- HellaSwag
- PIQA
- SIQA
</details>
<details open>
<summary><b>Mathematical Reasoning</b></summary>
- MATH
- GSM8K
</details>
<details open>
<summary><b>Theorem Application</b></summary>
- TheoremQA
- StrategyQA
- SciBench
</details>
<details open>
<summary><b>Comprehensive Reasoning</b></summary>
- BBH
</details>
</td>
<td>
<details open>
<summary><b>Junior High, High School, University, Professional Examinations</b></summary>
- C-Eval
- AGIEval
- MMLU
- GAOKAO-Bench
- CMMLU
- ARC
- Xiezhi
</details>
<details open>
<summary><b>Medical Examinations</b></summary>
- CMB
</details>
</td>
</tr>
</td>
</tr>
</tbody>
<tbody>
<tr align="center" valign="bottom">
<td>
<b>Understanding</b>
</td>
<td>
<b>Long Context</b>
</td>
<td>
<b>Safety</b>
</td>
<td>
<b>Code</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>Reading Comprehension</b></summary>
- C3
- CMRC
- DRCD
- MultiRC
- RACE
- DROP
- OpenBookQA
- SQuAD2.0
</details>
<details open>
<summary><b>Content Summary</b></summary>
- CSL
- LCSTS
- XSum
- SummScreen
</details>
<details open>
<summary><b>Content Analysis</b></summary>
- EPRSTMT
- LAMBADA
- TNEWS
</details>
</td>
<td>
<details open>
<summary><b>Long Context Understanding</b></summary>
- LEval
- LongBench
- GovReports
- NarrativeQA
- Qasper
</details>
</td>
<td>
<details open>
<summary><b>Safety</b></summary>
- CivilComments
- CrowsPairs
- CValues
- JigsawMultilingual
- TruthfulQA
</details>
<details open>
<summary><b>Robustness</b></summary>
- AdvGLUE
</details>
</td>
<td>
<details open>
<summary><b>Code</b></summary>
- HumanEval
- HumanEvalX
- MBPP
- APPs
- DS1000
</details>
</td>
</tr>
</td>
</tr>
</tbody>
</table>
<p align="right"><a href="#top">🔝Back to top</a></p>
## 📖 Model Support
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>Open-source Models</b>
</td>
<td>
<b>API Models</b>
</td>
<!-- <td>
<b>Custom Models</b>
</td> -->
</tr>
<tr valign="top">
<td>
- InternLM
- LLaMA
- Vicuna
- Alpaca
- Baichuan
- WizardLM
- ChatGLM2
- Falcon
- TigerBot
- Qwen
- ...
</td>
<td>
- OpenAI
- Claude
- PaLM (coming soon)
- ……
</td>
</tr>
</tbody>
</table>
<p align="right"><a href="#top">🔝Back to top</a></p>
## 🔜 Roadmap
- [ ] Subjective Evaluation

View File

@ -34,9 +34,10 @@
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2023.09.26\]** 我们在评测榜单上更新了[Qwen](https://github.com/QwenLM/Qwen), 这是目前表现最好的开源模型之一, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
- **\[2023.09.20\]** 我们在评测榜单上更新了[InternLM-20B](https://github.com/InternLM/InternLM), 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
- **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
- **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md).🔥🔥🔥.
- **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.
- **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md).
- **\[2023.09.08\]** 我们在评测榜单上更新了Baichuan-2/Tigerbot-2/Vicuna-v1.5, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情。
- **\[2023.09.06\]** 欢迎 [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) 团队采用OpenCompass对模型进行系统评估。我们非常感谢社区在提升LLM评估的透明度和可复现性上所做的努力。
- **\[2023.09.02\]** 我们加入了[Qwen-VL](https://github.com/QwenLM/Qwen-VL)的评测支持。
@ -53,7 +54,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
- **开源可复现**:提供公平、公开、可复现的大模型评测方案
- **全面的能力维度**:五大维度设计,提供 50+ 个数据集约 30 万题的的模型评测方案,全面评估模型能力
- **全面的能力维度**:五大维度设计,提供 70+ 个数据集约 40 万题的的模型评测方案,全面评估模型能力
- **丰富的模型支持**:已支持 20+ HuggingFace 及 API 模型
@ -69,245 +70,6 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 📖 数据集支持
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>语言</b>
</td>
<td>
<b>知识</b>
</td>
<td>
<b>推理</b>
</td>
<td>
<b>学科</b>
</td>
<td>
<b>理解</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>字词释义</b></summary>
- WiC
- SummEdits
</details>
<details open>
<summary><b>成语习语</b></summary>
- CHID
</details>
<details open>
<summary><b>语义相似度</b></summary>
- AFQMC
- BUSTM
</details>
<details open>
<summary><b>指代消解</b></summary>
- CLUEWSC
- WSC
- WinoGrande
</details>
<details open>
<summary><b>翻译</b></summary>
- Flores
</details>
</td>
<td>
<details open>
<summary><b>知识问答</b></summary>
- BoolQ
- CommonSenseQA
- NaturalQuestion
- TrivialQA
</details>
<details open>
<summary><b>多语种问答</b></summary>
- TyDi-QA
</details>
</td>
<td>
<details open>
<summary><b>文本蕴含</b></summary>
- CMNLI
- OCNLI
- OCNLI_FC
- AX-b
- AX-g
- CB
- RTE
</details>
<details open>
<summary><b>常识推理</b></summary>
- StoryCloze
- StoryCloze-CN即将上线
- COPA
- ReCoRD
- HellaSwag
- PIQA
- SIQA
</details>
<details open>
<summary><b>数学推理</b></summary>
- MATH
- GSM8K
</details>
<details open>
<summary><b>定理应用</b></summary>
- TheoremQA
</details>
<details open>
<summary><b>代码</b></summary>
- HumanEval
- MBPP
</details>
<details open>
<summary><b>综合推理</b></summary>
- BBH
</details>
</td>
<td>
<details open>
<summary><b>初中/高中/大学/职业考试</b></summary>
- GAOKAO-2023
- CEval
- AGIEval
- MMLU
- GAOKAO-Bench
- CMMLU
- ARC
</details>
</td>
<td>
<details open>
<summary><b>阅读理解</b></summary>
- C3
- CMRC
- DRCD
- MultiRC
- RACE
</details>
<details open>
<summary><b>内容总结</b></summary>
- CSL
- LCSTS
- XSum
</details>
<details open>
<summary><b>内容分析</b></summary>
- EPRSTMT
- LAMBADA
- TNEWS
</details>
</td>
</tr>
</td>
</tr>
</tbody>
</table>
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 📖 模型支持
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>开源模型</b>
</td>
<td>
<b>API 模型</b>
</td>
<!-- <td>
<b>自定义模型</b>
</td> -->
</tr>
<tr valign="top">
<td>
- LLaMA
- Vicuna
- Alpaca
- Baichuan
- WizardLM
- ChatGLM-6B
- ChatGLM2-6B
- MPT
- Falcon
- TigerBot
- MOSS
- ……
</td>
<td>
- OpenAI
- Claude (即将推出)
- PaLM (即将推出)
- ……
</td>
<!-- <td>
- GLM
- ……
</td> -->
</tr>
</tbody>
</table>
## 🛠️ 安装
下面展示了快速安装以及准备数据集的步骤。
@ -362,6 +124,316 @@ python run.py --datasets ceval_ppl mmlu_ppl \
更多教程请查看我们的[文档](https://opencompass.readthedocs.io/zh_CN/latest/index.html)。
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 📖 数据集支持
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>语言</b>
</td>
<td>
<b>知识</b>
</td>
<td>
<b>推理</b>
</td>
<td>
<b>考试</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>字词释义</b></summary>
- WiC
- SummEdits
</details>
<details open>
<summary><b>成语习语</b></summary>
- CHID
</details>
<details open>
<summary><b>语义相似度</b></summary>
- AFQMC
- BUSTM
</details>
<details open>
<summary><b>指代消解</b></summary>
- CLUEWSC
- WSC
- WinoGrande
</details>
<details open>
<summary><b>翻译</b></summary>
- Flores
- IWSLT2017
</details>
<details open>
<summary><b>多语种问答</b></summary>
- TyDi-QA
- XCOPA
</details>
<details open>
<summary><b>多语种总结</b></summary>
- XLSum
</details>
</td>
<td>
<details open>
<summary><b>知识问答</b></summary>
- BoolQ
- CommonSenseQA
- NaturalQuestions
- TriviaQA
</details>
</td>
<td>
<details open>
<summary><b>文本蕴含</b></summary>
- CMNLI
- OCNLI
- OCNLI_FC
- AX-b
- AX-g
- CB
- RTE
- ANLI
</details>
<details open>
<summary><b>常识推理</b></summary>
- StoryCloze
- COPA
- ReCoRD
- HellaSwag
- PIQA
- SIQA
</details>
<details open>
<summary><b>数学推理</b></summary>
- MATH
- GSM8K
</details>
<details open>
<summary><b>定理应用</b></summary>
- TheoremQA
- StrategyQA
- SciBench
</details>
<details open>
<summary><b>综合推理</b></summary>
- BBH
</details>
</td>
<td>
<details open>
<summary><b>初中/高中/大学/职业考试</b></summary>
- C-Eval
- AGIEval
- MMLU
- GAOKAO-Bench
- CMMLU
- ARC
- Xiezhi
</details>
<details open>
<summary><b>医学考试</b></summary>
- CMB
</details>
</td>
</tr>
</td>
</tr>
</tbody>
<tbody>
<tr align="center" valign="bottom">
<td>
<b>理解</b>
</td>
<td>
<b>长文本</b>
</td>
<td>
<b>安全</b>
</td>
<td>
<b>代码</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>阅读理解</b></summary>
- C3
- CMRC
- DRCD
- MultiRC
- RACE
- DROP
- OpenBookQA
- SQuAD2.0
</details>
<details open>
<summary><b>内容总结</b></summary>
- CSL
- LCSTS
- XSum
- SummScreen
</details>
<details open>
<summary><b>内容分析</b></summary>
- EPRSTMT
- LAMBADA
- TNEWS
</details>
</td>
<td>
<details open>
<summary><b>长文本理解</b></summary>
- LEval
- LongBench
- GovReports
- NarrativeQA
- Qasper
</details>
</td>
<td>
<details open>
<summary><b>安全</b></summary>
- CivilComments
- CrowsPairs
- CValues
- JigsawMultilingual
- TruthfulQA
</details>
<details open>
<summary><b>健壮性</b></summary>
- AdvGLUE
</details>
</td>
<td>
<details open>
<summary><b>代码</b></summary>
- HumanEval
- HumanEvalX
- MBPP
- APPs
- DS1000
</details>
</td>
</tr>
</td>
</tr>
</tbody>
</table>
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 📖 模型支持
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>开源模型</b>
</td>
<td>
<b>API 模型</b>
</td>
<!-- <td>
<b>自定义模型</b>
</td> -->
</tr>
<tr valign="top">
<td>
- InternLM
- LLaMA
- Vicuna
- Alpaca
- Baichuan
- WizardLM
- ChatGLM2
- Falcon
- TigerBot
- Qwen
- ……
</td>
<td>
- OpenAI
- Claude
- PaLM (即将推出)
- ……
</td>
</tr>
</tbody>
</table>
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 🔜 路线图
- [ ] 主观评测