mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00
[Doc] Update dataset list (#437)
* add new dataset list * add new dataset list * add new dataset list * update * update * update readme --------- Co-authored-by: gaotongxiao <gaotongxiao@gmail.com>
This commit is contained in:
parent
dc1b82c346
commit
d6261e109d
558
README.md
558
README.md
@ -34,9 +34,10 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
|
||||
|
||||
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
|
||||
|
||||
- **\[2023.09.26\]** We update the leaderboard with [Qwen](https://github.com/QwenLM/Qwen), one of the best-performing open-source models currently available, welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
|
||||
- **\[2023.09.20\]** We update the leaderboard with [InternLM-20B](https://github.com/InternLM/InternLM), welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
|
||||
- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
|
||||
- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md). 🔥🔥🔥.
|
||||
- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details.
|
||||
- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md).
|
||||
- **\[2023.09.08\]** We update the leaderboard with Baichuan-2/Tigerbot-2/Vicuna-v1.5, welcome to our [homepage](https://opencompass.org.cn) for more details.
|
||||
- **\[2023.09.06\]** [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
|
||||
- **\[2023.09.02\]** We have supported the evaluation of [Qwen-VL](https://github.com/QwenLM/Qwen-VL) in OpenCompass.
|
||||
@ -51,7 +52,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
|
||||
|
||||
OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features includes:
|
||||
|
||||
- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 50+ datasets with about 300,000 questions, comprehensively evaluating the capabilities of the models in five dimensions.
|
||||
- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 70+ datasets with about 400,000 questions, comprehensively evaluating the capabilities of the models in five dimensions.
|
||||
|
||||
- **Efficient distributed evaluation**: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours.
|
||||
|
||||
@ -67,247 +68,6 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
|
||||
|
||||
<p align="right"><a href="#top">🔝Back to top</a></p>
|
||||
|
||||
## 📖 Dataset Support
|
||||
|
||||
<table align="center">
|
||||
<tbody>
|
||||
<tr align="center" valign="bottom">
|
||||
<td>
|
||||
<b>Language</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>Knowledge</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>Reasoning</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>Comprehensive Examination</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>Understanding</b>
|
||||
</td>
|
||||
</tr>
|
||||
<tr valign="top">
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Word Definition</b></summary>
|
||||
|
||||
- WiC
|
||||
- SummEdits
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Idiom Learning</b></summary>
|
||||
|
||||
- CHID
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Semantic Similarity</b></summary>
|
||||
|
||||
- AFQMC
|
||||
- BUSTM
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Coreference Resolution</b></summary>
|
||||
|
||||
- CLUEWSC
|
||||
- WSC
|
||||
- WinoGrande
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Translation</b></summary>
|
||||
|
||||
- Flores
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Knowledge Question Answering</b></summary>
|
||||
|
||||
- BoolQ
|
||||
- CommonSenseQA
|
||||
- NaturalQuestion
|
||||
- TrivialQA
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Multi-language Question Answering</b></summary>
|
||||
|
||||
- TyDi-QA
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Textual Entailment</b></summary>
|
||||
|
||||
- CMNLI
|
||||
- OCNLI
|
||||
- OCNLI_FC
|
||||
- AX-b
|
||||
- AX-g
|
||||
- CB
|
||||
- RTE
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Commonsense Reasoning</b></summary>
|
||||
|
||||
- StoryCloze
|
||||
- StoryCloze-CN (coming soon)
|
||||
- COPA
|
||||
- ReCoRD
|
||||
- HellaSwag
|
||||
- PIQA
|
||||
- SIQA
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Mathematical Reasoning</b></summary>
|
||||
|
||||
- MATH
|
||||
- GSM8K
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Theorem Application</b></summary>
|
||||
|
||||
- TheoremQA
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Code</b></summary>
|
||||
|
||||
- HumanEval
|
||||
- MBPP
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Comprehensive Reasoning</b></summary>
|
||||
|
||||
- BBH
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Junior High, High School, University, Professional Examinations</b></summary>
|
||||
|
||||
- GAOKAO-2023
|
||||
- CEval
|
||||
- AGIEval
|
||||
- MMLU
|
||||
- GAOKAO-Bench
|
||||
- CMMLU
|
||||
- ARC
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Reading Comprehension</b></summary>
|
||||
|
||||
- C3
|
||||
- CMRC
|
||||
- DRCD
|
||||
- MultiRC
|
||||
- RACE
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Content Summary</b></summary>
|
||||
|
||||
- CSL
|
||||
- LCSTS
|
||||
- XSum
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Content Analysis</b></summary>
|
||||
|
||||
- EPRSTMT
|
||||
- LAMBADA
|
||||
- TNEWS
|
||||
|
||||
</details>
|
||||
</td>
|
||||
</tr>
|
||||
</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p align="right"><a href="#top">🔝Back to top</a></p>
|
||||
|
||||
## 📖 Model Support
|
||||
|
||||
<table align="center">
|
||||
<tbody>
|
||||
<tr align="center" valign="bottom">
|
||||
<td>
|
||||
<b>Open-source Models</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>API Models</b>
|
||||
</td>
|
||||
<!-- <td>
|
||||
<b>Custom Models</b>
|
||||
</td> -->
|
||||
</tr>
|
||||
<tr valign="top">
|
||||
<td>
|
||||
|
||||
- InternLM
|
||||
- LLaMA
|
||||
- Vicuna
|
||||
- Alpaca
|
||||
- Baichuan
|
||||
- WizardLM
|
||||
- ChatGLM-6B
|
||||
- ChatGLM2-6B
|
||||
- MPT
|
||||
- Falcon
|
||||
- TigerBot
|
||||
- MOSS
|
||||
- ...
|
||||
|
||||
</td>
|
||||
<td>
|
||||
|
||||
- OpenAI
|
||||
- Claude (coming soon)
|
||||
- PaLM (coming soon)
|
||||
- ……
|
||||
|
||||
</td>
|
||||
|
||||
<!--
|
||||
- GLM
|
||||
- ...
|
||||
|
||||
</td> -->
|
||||
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
## 🛠️ Installation
|
||||
|
||||
Below are the steps for quick installation and datasets preparation.
|
||||
@ -360,6 +120,316 @@ python run.py --datasets ceval_ppl mmlu_ppl \
|
||||
|
||||
Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started.html) to learn how to run an evaluation task.
|
||||
|
||||
<p align="right"><a href="#top">🔝Back to top</a></p>
|
||||
|
||||
## 📖 Dataset Support
|
||||
|
||||
<table align="center">
|
||||
<tbody>
|
||||
<tr align="center" valign="bottom">
|
||||
<td>
|
||||
<b>Language</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>Knowledge</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>Reasoning</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>Examination</b>
|
||||
</td>
|
||||
</tr>
|
||||
<tr valign="top">
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Word Definition</b></summary>
|
||||
|
||||
- WiC
|
||||
- SummEdits
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Idiom Learning</b></summary>
|
||||
|
||||
- CHID
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Semantic Similarity</b></summary>
|
||||
|
||||
- AFQMC
|
||||
- BUSTM
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Coreference Resolution</b></summary>
|
||||
|
||||
- CLUEWSC
|
||||
- WSC
|
||||
- WinoGrande
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Translation</b></summary>
|
||||
|
||||
- Flores
|
||||
- IWSLT2017
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Multi-language Question Answering</b></summary>
|
||||
|
||||
- TyDi-QA
|
||||
- XCOPA
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Multi-language Summary</b></summary>
|
||||
|
||||
- XLSum
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Knowledge Question Answering</b></summary>
|
||||
|
||||
- BoolQ
|
||||
- CommonSenseQA
|
||||
- NaturalQuestions
|
||||
- TriviaQA
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Textual Entailment</b></summary>
|
||||
|
||||
- CMNLI
|
||||
- OCNLI
|
||||
- OCNLI_FC
|
||||
- AX-b
|
||||
- AX-g
|
||||
- CB
|
||||
- RTE
|
||||
- ANLI
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Commonsense Reasoning</b></summary>
|
||||
|
||||
- StoryCloze
|
||||
- COPA
|
||||
- ReCoRD
|
||||
- HellaSwag
|
||||
- PIQA
|
||||
- SIQA
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Mathematical Reasoning</b></summary>
|
||||
|
||||
- MATH
|
||||
- GSM8K
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Theorem Application</b></summary>
|
||||
|
||||
- TheoremQA
|
||||
- StrategyQA
|
||||
- SciBench
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Comprehensive Reasoning</b></summary>
|
||||
|
||||
- BBH
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Junior High, High School, University, Professional Examinations</b></summary>
|
||||
|
||||
- C-Eval
|
||||
- AGIEval
|
||||
- MMLU
|
||||
- GAOKAO-Bench
|
||||
- CMMLU
|
||||
- ARC
|
||||
- Xiezhi
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Medical Examinations</b></summary>
|
||||
|
||||
- CMB
|
||||
|
||||
</details>
|
||||
</td>
|
||||
</tr>
|
||||
</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
<tbody>
|
||||
<tr align="center" valign="bottom">
|
||||
<td>
|
||||
<b>Understanding</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>Long Context</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>Safety</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>Code</b>
|
||||
</td>
|
||||
</tr>
|
||||
<tr valign="top">
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Reading Comprehension</b></summary>
|
||||
|
||||
- C3
|
||||
- CMRC
|
||||
- DRCD
|
||||
- MultiRC
|
||||
- RACE
|
||||
- DROP
|
||||
- OpenBookQA
|
||||
- SQuAD2.0
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Content Summary</b></summary>
|
||||
|
||||
- CSL
|
||||
- LCSTS
|
||||
- XSum
|
||||
- SummScreen
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>Content Analysis</b></summary>
|
||||
|
||||
- EPRSTMT
|
||||
- LAMBADA
|
||||
- TNEWS
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Long Context Understanding</b></summary>
|
||||
|
||||
- LEval
|
||||
- LongBench
|
||||
- GovReports
|
||||
- NarrativeQA
|
||||
- Qasper
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Safety</b></summary>
|
||||
|
||||
- CivilComments
|
||||
- CrowsPairs
|
||||
- CValues
|
||||
- JigsawMultilingual
|
||||
- TruthfulQA
|
||||
|
||||
</details>
|
||||
<details open>
|
||||
<summary><b>Robustness</b></summary>
|
||||
|
||||
- AdvGLUE
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>Code</b></summary>
|
||||
|
||||
- HumanEval
|
||||
- HumanEvalX
|
||||
- MBPP
|
||||
- APPs
|
||||
- DS1000
|
||||
|
||||
</details>
|
||||
</td>
|
||||
</tr>
|
||||
</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p align="right"><a href="#top">🔝Back to top</a></p>
|
||||
|
||||
## 📖 Model Support
|
||||
|
||||
<table align="center">
|
||||
<tbody>
|
||||
<tr align="center" valign="bottom">
|
||||
<td>
|
||||
<b>Open-source Models</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>API Models</b>
|
||||
</td>
|
||||
<!-- <td>
|
||||
<b>Custom Models</b>
|
||||
</td> -->
|
||||
</tr>
|
||||
<tr valign="top">
|
||||
<td>
|
||||
|
||||
- InternLM
|
||||
- LLaMA
|
||||
- Vicuna
|
||||
- Alpaca
|
||||
- Baichuan
|
||||
- WizardLM
|
||||
- ChatGLM2
|
||||
- Falcon
|
||||
- TigerBot
|
||||
- Qwen
|
||||
- ...
|
||||
|
||||
</td>
|
||||
<td>
|
||||
|
||||
- OpenAI
|
||||
- Claude
|
||||
- PaLM (coming soon)
|
||||
- ……
|
||||
|
||||
</td>
|
||||
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p align="right"><a href="#top">🔝Back to top</a></p>
|
||||
|
||||
## 🔜 Roadmap
|
||||
|
||||
- [ ] Subjective Evaluation
|
||||
|
556
README_zh-CN.md
556
README_zh-CN.md
@ -34,9 +34,10 @@
|
||||
|
||||
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
|
||||
|
||||
- **\[2023.09.26\]** 我们在评测榜单上更新了[Qwen](https://github.com/QwenLM/Qwen), 这是目前表现最好的开源模型之一, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
|
||||
- **\[2023.09.20\]** 我们在评测榜单上更新了[InternLM-20B](https://github.com/InternLM/InternLM), 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
|
||||
- **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
|
||||
- **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md).🔥🔥🔥.
|
||||
- **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.
|
||||
- **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md).
|
||||
- **\[2023.09.08\]** 我们在评测榜单上更新了Baichuan-2/Tigerbot-2/Vicuna-v1.5, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情。
|
||||
- **\[2023.09.06\]** 欢迎 [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) 团队采用OpenCompass对模型进行系统评估。我们非常感谢社区在提升LLM评估的透明度和可复现性上所做的努力。
|
||||
- **\[2023.09.02\]** 我们加入了[Qwen-VL](https://github.com/QwenLM/Qwen-VL)的评测支持。
|
||||
@ -53,7 +54,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
|
||||
|
||||
- **开源可复现**:提供公平、公开、可复现的大模型评测方案
|
||||
|
||||
- **全面的能力维度**:五大维度设计,提供 50+ 个数据集约 30 万题的的模型评测方案,全面评估模型能力
|
||||
- **全面的能力维度**:五大维度设计,提供 70+ 个数据集约 40 万题的的模型评测方案,全面评估模型能力
|
||||
|
||||
- **丰富的模型支持**:已支持 20+ HuggingFace 及 API 模型
|
||||
|
||||
@ -69,245 +70,6 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
|
||||
|
||||
<p align="right"><a href="#top">🔝返回顶部</a></p>
|
||||
|
||||
## 📖 数据集支持
|
||||
|
||||
<table align="center">
|
||||
<tbody>
|
||||
<tr align="center" valign="bottom">
|
||||
<td>
|
||||
<b>语言</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>知识</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>推理</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>学科</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>理解</b>
|
||||
</td>
|
||||
</tr>
|
||||
<tr valign="top">
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>字词释义</b></summary>
|
||||
|
||||
- WiC
|
||||
- SummEdits
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>成语习语</b></summary>
|
||||
|
||||
- CHID
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>语义相似度</b></summary>
|
||||
|
||||
- AFQMC
|
||||
- BUSTM
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>指代消解</b></summary>
|
||||
|
||||
- CLUEWSC
|
||||
- WSC
|
||||
- WinoGrande
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>翻译</b></summary>
|
||||
|
||||
- Flores
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>知识问答</b></summary>
|
||||
|
||||
- BoolQ
|
||||
- CommonSenseQA
|
||||
- NaturalQuestion
|
||||
- TrivialQA
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>多语种问答</b></summary>
|
||||
|
||||
- TyDi-QA
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>文本蕴含</b></summary>
|
||||
|
||||
- CMNLI
|
||||
- OCNLI
|
||||
- OCNLI_FC
|
||||
- AX-b
|
||||
- AX-g
|
||||
- CB
|
||||
- RTE
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>常识推理</b></summary>
|
||||
|
||||
- StoryCloze
|
||||
- StoryCloze-CN(即将上线)
|
||||
- COPA
|
||||
- ReCoRD
|
||||
- HellaSwag
|
||||
- PIQA
|
||||
- SIQA
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>数学推理</b></summary>
|
||||
|
||||
- MATH
|
||||
- GSM8K
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>定理应用</b></summary>
|
||||
|
||||
- TheoremQA
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>代码</b></summary>
|
||||
|
||||
- HumanEval
|
||||
- MBPP
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>综合推理</b></summary>
|
||||
|
||||
- BBH
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>初中/高中/大学/职业考试</b></summary>
|
||||
|
||||
- GAOKAO-2023
|
||||
- CEval
|
||||
- AGIEval
|
||||
- MMLU
|
||||
- GAOKAO-Bench
|
||||
- CMMLU
|
||||
- ARC
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>阅读理解</b></summary>
|
||||
|
||||
- C3
|
||||
- CMRC
|
||||
- DRCD
|
||||
- MultiRC
|
||||
- RACE
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>内容总结</b></summary>
|
||||
|
||||
- CSL
|
||||
- LCSTS
|
||||
- XSum
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>内容分析</b></summary>
|
||||
|
||||
- EPRSTMT
|
||||
- LAMBADA
|
||||
- TNEWS
|
||||
|
||||
</details>
|
||||
</td>
|
||||
</tr>
|
||||
</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p align="right"><a href="#top">🔝返回顶部</a></p>
|
||||
|
||||
## 📖 模型支持
|
||||
|
||||
<table align="center">
|
||||
<tbody>
|
||||
<tr align="center" valign="bottom">
|
||||
<td>
|
||||
<b>开源模型</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>API 模型</b>
|
||||
</td>
|
||||
<!-- <td>
|
||||
<b>自定义模型</b>
|
||||
</td> -->
|
||||
</tr>
|
||||
<tr valign="top">
|
||||
<td>
|
||||
|
||||
- LLaMA
|
||||
- Vicuna
|
||||
- Alpaca
|
||||
- Baichuan
|
||||
- WizardLM
|
||||
- ChatGLM-6B
|
||||
- ChatGLM2-6B
|
||||
- MPT
|
||||
- Falcon
|
||||
- TigerBot
|
||||
- MOSS
|
||||
- ……
|
||||
|
||||
</td>
|
||||
<td>
|
||||
|
||||
- OpenAI
|
||||
- Claude (即将推出)
|
||||
- PaLM (即将推出)
|
||||
- ……
|
||||
|
||||
</td>
|
||||
<!-- <td>
|
||||
|
||||
- GLM
|
||||
- ……
|
||||
|
||||
</td> -->
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
## 🛠️ 安装
|
||||
|
||||
下面展示了快速安装以及准备数据集的步骤。
|
||||
@ -362,6 +124,316 @@ python run.py --datasets ceval_ppl mmlu_ppl \
|
||||
|
||||
更多教程请查看我们的[文档](https://opencompass.readthedocs.io/zh_CN/latest/index.html)。
|
||||
|
||||
<p align="right"><a href="#top">🔝返回顶部</a></p>
|
||||
|
||||
## 📖 数据集支持
|
||||
|
||||
<table align="center">
|
||||
<tbody>
|
||||
<tr align="center" valign="bottom">
|
||||
<td>
|
||||
<b>语言</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>知识</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>推理</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>考试</b>
|
||||
</td>
|
||||
</tr>
|
||||
<tr valign="top">
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>字词释义</b></summary>
|
||||
|
||||
- WiC
|
||||
- SummEdits
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>成语习语</b></summary>
|
||||
|
||||
- CHID
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>语义相似度</b></summary>
|
||||
|
||||
- AFQMC
|
||||
- BUSTM
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>指代消解</b></summary>
|
||||
|
||||
- CLUEWSC
|
||||
- WSC
|
||||
- WinoGrande
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>翻译</b></summary>
|
||||
|
||||
- Flores
|
||||
- IWSLT2017
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>多语种问答</b></summary>
|
||||
|
||||
- TyDi-QA
|
||||
- XCOPA
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>多语种总结</b></summary>
|
||||
|
||||
- XLSum
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>知识问答</b></summary>
|
||||
|
||||
- BoolQ
|
||||
- CommonSenseQA
|
||||
- NaturalQuestions
|
||||
- TriviaQA
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>文本蕴含</b></summary>
|
||||
|
||||
- CMNLI
|
||||
- OCNLI
|
||||
- OCNLI_FC
|
||||
- AX-b
|
||||
- AX-g
|
||||
- CB
|
||||
- RTE
|
||||
- ANLI
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>常识推理</b></summary>
|
||||
|
||||
- StoryCloze
|
||||
- COPA
|
||||
- ReCoRD
|
||||
- HellaSwag
|
||||
- PIQA
|
||||
- SIQA
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>数学推理</b></summary>
|
||||
|
||||
- MATH
|
||||
- GSM8K
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>定理应用</b></summary>
|
||||
|
||||
- TheoremQA
|
||||
- StrategyQA
|
||||
- SciBench
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>综合推理</b></summary>
|
||||
|
||||
- BBH
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>初中/高中/大学/职业考试</b></summary>
|
||||
|
||||
- C-Eval
|
||||
- AGIEval
|
||||
- MMLU
|
||||
- GAOKAO-Bench
|
||||
- CMMLU
|
||||
- ARC
|
||||
- Xiezhi
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>医学考试</b></summary>
|
||||
|
||||
- CMB
|
||||
|
||||
</details>
|
||||
</td>
|
||||
</tr>
|
||||
</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
<tbody>
|
||||
<tr align="center" valign="bottom">
|
||||
<td>
|
||||
<b>理解</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>长文本</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>安全</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>代码</b>
|
||||
</td>
|
||||
</tr>
|
||||
<tr valign="top">
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>阅读理解</b></summary>
|
||||
|
||||
- C3
|
||||
- CMRC
|
||||
- DRCD
|
||||
- MultiRC
|
||||
- RACE
|
||||
- DROP
|
||||
- OpenBookQA
|
||||
- SQuAD2.0
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>内容总结</b></summary>
|
||||
|
||||
- CSL
|
||||
- LCSTS
|
||||
- XSum
|
||||
- SummScreen
|
||||
|
||||
</details>
|
||||
|
||||
<details open>
|
||||
<summary><b>内容分析</b></summary>
|
||||
|
||||
- EPRSTMT
|
||||
- LAMBADA
|
||||
- TNEWS
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>长文本理解</b></summary>
|
||||
|
||||
- LEval
|
||||
- LongBench
|
||||
- GovReports
|
||||
- NarrativeQA
|
||||
- Qasper
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>安全</b></summary>
|
||||
|
||||
- CivilComments
|
||||
- CrowsPairs
|
||||
- CValues
|
||||
- JigsawMultilingual
|
||||
- TruthfulQA
|
||||
|
||||
</details>
|
||||
<details open>
|
||||
<summary><b>健壮性</b></summary>
|
||||
|
||||
- AdvGLUE
|
||||
|
||||
</details>
|
||||
</td>
|
||||
<td>
|
||||
<details open>
|
||||
<summary><b>代码</b></summary>
|
||||
|
||||
- HumanEval
|
||||
- HumanEvalX
|
||||
- MBPP
|
||||
- APPs
|
||||
- DS1000
|
||||
|
||||
</details>
|
||||
</td>
|
||||
</tr>
|
||||
</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p align="right"><a href="#top">🔝返回顶部</a></p>
|
||||
|
||||
## 📖 模型支持
|
||||
|
||||
<table align="center">
|
||||
<tbody>
|
||||
<tr align="center" valign="bottom">
|
||||
<td>
|
||||
<b>开源模型</b>
|
||||
</td>
|
||||
<td>
|
||||
<b>API 模型</b>
|
||||
</td>
|
||||
<!-- <td>
|
||||
<b>自定义模型</b>
|
||||
</td> -->
|
||||
</tr>
|
||||
<tr valign="top">
|
||||
<td>
|
||||
|
||||
- InternLM
|
||||
- LLaMA
|
||||
- Vicuna
|
||||
- Alpaca
|
||||
- Baichuan
|
||||
- WizardLM
|
||||
- ChatGLM2
|
||||
- Falcon
|
||||
- TigerBot
|
||||
- Qwen
|
||||
- ……
|
||||
|
||||
</td>
|
||||
<td>
|
||||
|
||||
- OpenAI
|
||||
- Claude
|
||||
- PaLM (即将推出)
|
||||
- ……
|
||||
|
||||
</td>
|
||||
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<p align="right"><a href="#top">🔝返回顶部</a></p>
|
||||
|
||||
## 🔜 路线图
|
||||
|
||||
- [ ] 主观评测
|
||||
|
Loading…
Reference in New Issue
Block a user