From d6261e109d5dd6babdae893c9472d27fc8d71d83 Mon Sep 17 00:00:00 2001
From: Leymore <zfz-960727@163.com>
Date: Wed, 27 Sep 2023 15:02:09 +0800
Subject: [PATCH] [Doc] Update dataset list (#437)

* add new dataset list

* add new dataset list

* add new dataset list

* update

* update

* update readme

---------

Co-authored-by: gaotongxiao <gaotongxiao@gmail.com>
---
 README.md       | 558 +++++++++++++++++++++++++++---------------------
 README_zh-CN.md | 556 ++++++++++++++++++++++++++---------------------
 2 files changed, 628 insertions(+), 486 deletions(-)
diff --git a/README.md b/README.md
index 64ac4b13..6fbf1011 100644
--- a/README.md
+++ b/README.md
@@ -34,9 +34,10 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
 
 ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
 
+- **\[2023.09.26\]** We update the leaderboard with [Qwen](https://github.com/QwenLM/Qwen), one of the best-performing open-source models currently available, welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
 - **\[2023.09.20\]** We update the leaderboard with [InternLM-20B](https://github.com/InternLM/InternLM), welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
-- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
-- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md). 🔥🔥🔥.
+- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details.
+- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md).
 - **\[2023.09.08\]** We update the leaderboard with Baichuan-2/Tigerbot-2/Vicuna-v1.5, welcome to our [homepage](https://opencompass.org.cn) for more details.
 - **\[2023.09.06\]**  [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
 - **\[2023.09.02\]** We have supported the evaluation of [Qwen-VL](https://github.com/QwenLM/Qwen-VL) in OpenCompass.
@@ -51,7 +52,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
 
 OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features includes:
 
-- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 50+ datasets with about 300,000 questions, comprehensively evaluating the capabilities of the models in five dimensions.
+- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 70+ datasets with about 400,000 questions, comprehensively evaluating the capabilities of the models in five dimensions.
 
 - **Efficient distributed evaluation**: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours.
 
@@ -67,247 +68,6 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
 
 <p align="right"><a href="#top">🔝Back to top</a></p>
 
-## 📖 Dataset Support
-
-<table align="center">
-  <tbody>
-    <tr align="center" valign="bottom">
-      <td>
-        <b>Language</b>
-      </td>
-      <td>
-        <b>Knowledge</b>
-      </td>
-      <td>
-        <b>Reasoning</b>
-      </td>
-      <td>
-        <b>Comprehensive Examination</b>
-      </td>
-      <td>
-        <b>Understanding</b>
-      </td>
-    </tr>
-    <tr valign="top">
-      <td>
-<details open>
-<summary><b>Word Definition</b></summary>
-
-- WiC
-- SummEdits
-
-</details>
-
-<details open>
-<summary><b>Idiom Learning</b></summary>
-
-- CHID
-
-</details>
-
-<details open>
-<summary><b>Semantic Similarity</b></summary>
-
-- AFQMC
-- BUSTM
-
-</details>
-
-<details open>
-<summary><b>Coreference Resolution</b></summary>
-
-- CLUEWSC
-- WSC
-- WinoGrande
-
-</details>
-
-<details open>
-<summary><b>Translation</b></summary>
-
-- Flores
-
-</details>
-      </td>
-      <td>
-<details open>
-<summary><b>Knowledge Question Answering</b></summary>
-
-- BoolQ
-- CommonSenseQA
-- NaturalQuestion
-- TrivialQA
-
-</details>
-
-<details open>
-<summary><b>Multi-language Question Answering</b></summary>
-
-- TyDi-QA
-
-</details>
-      </td>
-      <td>
-<details open>
-<summary><b>Textual Entailment</b></summary>
-
-- CMNLI
-- OCNLI
-- OCNLI_FC
-- AX-b
-- AX-g
-- CB
-- RTE
-
-</details>
-
-<details open>
-<summary><b>Commonsense Reasoning</b></summary>
-
-- StoryCloze
-- StoryCloze-CN (coming soon)
-- COPA
-- ReCoRD
-- HellaSwag
-- PIQA
-- SIQA
-
-</details>
-
-<details open>
-<summary><b>Mathematical Reasoning</b></summary>
-
-- MATH
-- GSM8K
-
-</details>
-
-<details open>
-<summary><b>Theorem Application</b></summary>
-
-- TheoremQA
-
-</details>
-
-<details open>
-<summary><b>Code</b></summary>
-
-- HumanEval
-- MBPP
-
-</details>
-
-<details open>
-<summary><b>Comprehensive Reasoning</b></summary>
-
-- BBH
-
-</details>
-      </td>
-      <td>
-<details open>
-<summary><b>Junior High, High School, University, Professional Examinations</b></summary>
-
-- GAOKAO-2023
-- CEval
-- AGIEval
-- MMLU
-- GAOKAO-Bench
-- CMMLU
-- ARC
-
-</details>
-      </td>
-      <td>
-<details open>
-<summary><b>Reading Comprehension</b></summary>
-
-- C3
-- CMRC
-- DRCD
-- MultiRC
-- RACE
-
-</details>
-
-<details open>
-<summary><b>Content Summary</b></summary>
-
-- CSL
-- LCSTS
-- XSum
-
-</details>
-
-<details open>
-<summary><b>Content Analysis</b></summary>
-
-- EPRSTMT
-- LAMBADA
-- TNEWS
-
-</details>
-      </td>
-    </tr>
-</td>
-    </tr>
-  </tbody>
-</table>
-
-<p align="right"><a href="#top">🔝Back to top</a></p>
-
-## 📖 Model Support
-
-<table align="center">
-  <tbody>
-    <tr align="center" valign="bottom">
-      <td>
-        <b>Open-source Models</b>
-      </td>
-      <td>
-        <b>API Models</b>
-      </td>
-      <!-- <td>
-        <b>Custom Models</b>
-      </td> -->
-    </tr>
-    <tr valign="top">
-      <td>
-
-- InternLM
-- LLaMA
-- Vicuna
-- Alpaca
-- Baichuan
-- WizardLM
-- ChatGLM-6B
-- ChatGLM2-6B
-- MPT
-- Falcon
-- TigerBot
-- MOSS
-- ...
-
-</td>
-<td>
-
-- OpenAI
-- Claude (coming soon)
-- PaLM (coming soon)
-- ……
-
-</td>
-
-<!--
-- GLM
-- ...
-
-</td> -->
-
-</tr>
-  </tbody>
-</table>
-
 ## 🛠️ Installation
 
 Below are the steps for quick installation and datasets preparation.
@@ -360,6 +120,316 @@ python run.py --datasets ceval_ppl mmlu_ppl \
 
 Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started.html) to learn how to run an evaluation task.
 
+<p align="right"><a href="#top">🔝Back to top</a></p>
+
+## 📖 Dataset Support
+
+<table align="center">
+  <tbody>
+    <tr align="center" valign="bottom">
+      <td>
+        <b>Language</b>
+      </td>
+      <td>
+        <b>Knowledge</b>
+      </td>
+      <td>
+        <b>Reasoning</b>
+      </td>
+      <td>
+        <b>Examination</b>
+      </td>
+    </tr>
+    <tr valign="top">
+      <td>
+<details open>
+<summary><b>Word Definition</b></summary>
+
+- WiC
+- SummEdits
+
+</details>
+
+<details open>
+<summary><b>Idiom Learning</b></summary>
+
+- CHID
+
+</details>
+
+<details open>
+<summary><b>Semantic Similarity</b></summary>
+
+- AFQMC
+- BUSTM
+
+</details>
+
+<details open>
+<summary><b>Coreference Resolution</b></summary>
+
+- CLUEWSC
+- WSC
+- WinoGrande
+
+</details>
+
+<details open>
+<summary><b>Translation</b></summary>
+
+- Flores
+- IWSLT2017
+
+</details>
+
+<details open>
+<summary><b>Multi-language Question Answering</b></summary>
+
+- TyDi-QA
+- XCOPA
+
+</details>
+
+<details open>
+<summary><b>Multi-language Summary</b></summary>
+
+- XLSum
+
+</details>
+      </td>
+      <td>
+<details open>
+<summary><b>Knowledge Question Answering</b></summary>
+
+- BoolQ
+- CommonSenseQA
+- NaturalQuestions
+- TriviaQA
+
+</details>
+      </td>
+      <td>
+<details open>
+<summary><b>Textual Entailment</b></summary>
+
+- CMNLI
+- OCNLI
+- OCNLI_FC
+- AX-b
+- AX-g
+- CB
+- RTE
+- ANLI
+
+</details>
+
+<details open>
+<summary><b>Commonsense Reasoning</b></summary>
+
+- StoryCloze
+- COPA
+- ReCoRD
+- HellaSwag
+- PIQA
+- SIQA
+
+</details>
+
+<details open>
+<summary><b>Mathematical Reasoning</b></summary>
+
+- MATH
+- GSM8K
+
+</details>
+
+<details open>
+<summary><b>Theorem Application</b></summary>
+
+- TheoremQA
+- StrategyQA
+- SciBench
+
+</details>
+
+<details open>
+<summary><b>Comprehensive Reasoning</b></summary>
+
+- BBH
+
+</details>
+      </td>
+      <td>
+<details open>
+<summary><b>Junior High, High School, University, Professional Examinations</b></summary>
+
+- C-Eval
+- AGIEval
+- MMLU
+- GAOKAO-Bench
+- CMMLU
+- ARC
+- Xiezhi
+
+</details>
+
+<details open>
+<summary><b>Medical Examinations</b></summary>
+
+- CMB
+
+</details>
+      </td>
+    </tr>
+</td>
+    </tr>
+  </tbody>
+  <tbody>
+    <tr align="center" valign="bottom">
+      <td>
+        <b>Understanding</b>
+      </td>
+      <td>
+        <b>Long Context</b>
+      </td>
+      <td>
+        <b>Safety</b>
+      </td>
+      <td>
+        <b>Code</b>
+      </td>
+    </tr>
+    <tr valign="top">
+      <td>
+<details open>
+<summary><b>Reading Comprehension</b></summary>
+
+- C3
+- CMRC
+- DRCD
+- MultiRC
+- RACE
+- DROP
+- OpenBookQA
+- SQuAD2.0
+
+</details>
+
+<details open>
+<summary><b>Content Summary</b></summary>
+
+- CSL
+- LCSTS
+- XSum
+- SummScreen
+
+</details>
+
+<details open>
+<summary><b>Content Analysis</b></summary>
+
+- EPRSTMT
+- LAMBADA
+- TNEWS
+
+</details>
+      </td>
+      <td>
+<details open>
+<summary><b>Long Context Understanding</b></summary>
+
+- LEval
+- LongBench
+- GovReports
+- NarrativeQA
+- Qasper
+
+</details>
+      </td>
+      <td>
+<details open>
+<summary><b>Safety</b></summary>
+
+- CivilComments
+- CrowsPairs
+- CValues
+- JigsawMultilingual
+- TruthfulQA
+
+</details>
+<details open>
+<summary><b>Robustness</b></summary>
+
+- AdvGLUE
+
+</details>
+      </td>
+      <td>
+<details open>
+<summary><b>Code</b></summary>
+
+- HumanEval
+- HumanEvalX
+- MBPP
+- APPs
+- DS1000
+
+</details>
+      </td>
+    </tr>
+</td>
+    </tr>
+  </tbody>
+</table>
+
+<p align="right"><a href="#top">🔝Back to top</a></p>
+
+## 📖 Model Support
+
+<table align="center">
+  <tbody>
+    <tr align="center" valign="bottom">
+      <td>
+        <b>Open-source Models</b>
+      </td>
+      <td>
+        <b>API Models</b>
+      </td>
+      <!-- <td>
+        <b>Custom Models</b>
+      </td> -->
+    </tr>
+    <tr valign="top">
+      <td>
+
+- InternLM
+- LLaMA
+- Vicuna
+- Alpaca
+- Baichuan
+- WizardLM
+- ChatGLM2
+- Falcon
+- TigerBot
+- Qwen
+- ...
+
+</td>
+<td>
+
+- OpenAI
+- Claude
+- PaLM (coming soon)
+- ……
+
+</td>
+
+</tr>
+  </tbody>
+</table>
+
+<p align="right"><a href="#top">🔝Back to top</a></p>
+
 ## 🔜 Roadmap
 
 - [ ] Subjective Evaluation
diff --git a/README_zh-CN.md b/README_zh-CN.md
index be833994..74f06505 100644
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -34,9 +34,10 @@
 
 ## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
 
+- **\[2023.09.26\]** 我们在评测榜单上更新了[Qwen](https://github.com/QwenLM/Qwen), 这是目前表现最好的开源模型之一, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
 - **\[2023.09.20\]** 我们在评测榜单上更新了[InternLM-20B](https://github.com/InternLM/InternLM), 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
-- **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
-- **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md).🔥🔥🔥.
+- **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.
+- **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md).
 - **\[2023.09.08\]** 我们在评测榜单上更新了Baichuan-2/Tigerbot-2/Vicuna-v1.5, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情。
 - **\[2023.09.06\]** 欢迎 [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) 团队采用OpenCompass对模型进行系统评估。我们非常感谢社区在提升LLM评估的透明度和可复现性上所做的努力。
 - **\[2023.09.02\]** 我们加入了[Qwen-VL](https://github.com/QwenLM/Qwen-VL)的评测支持。
@@ -53,7 +54,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
 
 - **开源可复现**：提供公平、公开、可复现的大模型评测方案
 
-- **全面的能力维度**：五大维度设计，提供 50+ 个数据集约 30 万题的的模型评测方案，全面评估模型能力
+- **全面的能力维度**：五大维度设计，提供 70+ 个数据集约 40 万题的的模型评测方案，全面评估模型能力
 
 - **丰富的模型支持**：已支持 20+ HuggingFace 及 API 模型
 
@@ -69,245 +70,6 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
 
 <p align="right"><a href="#top">🔝返回顶部</a></p>
 
-## 📖 数据集支持
-
-<table align="center">
-  <tbody>
-    <tr align="center" valign="bottom">
-      <td>
-        <b>语言</b>
-      </td>
-      <td>
-        <b>知识</b>
-      </td>
-      <td>
-        <b>推理</b>
-      </td>
-      <td>
-        <b>学科</b>
-      </td>
-      <td>
-        <b>理解</b>
-      </td>
-    </tr>
-    <tr valign="top">
-      <td>
-<details open>
-<summary><b>字词释义</b></summary>
-
-- WiC
-- SummEdits
-
-</details>
-
-<details open>
-<summary><b>成语习语</b></summary>
-
-- CHID
-
-</details>
-
-<details open>
-<summary><b>语义相似度</b></summary>
-
-- AFQMC
-- BUSTM
-
-</details>
-
-<details open>
-<summary><b>指代消解</b></summary>
-
-- CLUEWSC
-- WSC
-- WinoGrande
-
-</details>
-
-<details open>
-<summary><b>翻译</b></summary>
-
-- Flores
-
-</details>
-      </td>
-      <td>
-<details open>
-<summary><b>知识问答</b></summary>
-
-- BoolQ
-- CommonSenseQA
-- NaturalQuestion
-- TrivialQA
-
-</details>
-
-<details open>
-<summary><b>多语种问答</b></summary>
-
-- TyDi-QA
-
-</details>
-      </td>
-      <td>
-<details open>
-<summary><b>文本蕴含</b></summary>
-
-- CMNLI
-- OCNLI
-- OCNLI_FC
-- AX-b
-- AX-g
-- CB
-- RTE
-
-</details>
-
-<details open>
-<summary><b>常识推理</b></summary>
-
-- StoryCloze
-- StoryCloze-CN（即将上线）
-- COPA
-- ReCoRD
-- HellaSwag
-- PIQA
-- SIQA
-
-</details>
-
-<details open>
-<summary><b>数学推理</b></summary>
-
-- MATH
-- GSM8K
-
-</details>
-
-<details open>
-<summary><b>定理应用</b></summary>
-
-- TheoremQA
-
-</details>
-
-<details open>
-<summary><b>代码</b></summary>
-
-- HumanEval
-- MBPP
-
-</details>
-
-<details open>
-<summary><b>综合推理</b></summary>
-
-- BBH
-
-</details>
-      </td>
-      <td>
-<details open>
-<summary><b>初中/高中/大学/职业考试</b></summary>
-
-- GAOKAO-2023
-- CEval
-- AGIEval
-- MMLU
-- GAOKAO-Bench
-- CMMLU
-- ARC
-
-</details>
-      </td>
-      <td>
-<details open>
-<summary><b>阅读理解</b></summary>
-
-- C3
-- CMRC
-- DRCD
-- MultiRC
-- RACE
-
-</details>
-
-<details open>
-<summary><b>内容总结</b></summary>
-
-- CSL
-- LCSTS
-- XSum
-
-</details>
-
-<details open>
-<summary><b>内容分析</b></summary>
-
-- EPRSTMT
-- LAMBADA
-- TNEWS
-
-</details>
-      </td>
-    </tr>
-</td>
-    </tr>
-  </tbody>
-</table>
-
-<p align="right"><a href="#top">🔝返回顶部</a></p>
-
-## 📖 模型支持
-
-<table align="center">
-  <tbody>
-    <tr align="center" valign="bottom">
-      <td>
-        <b>开源模型</b>
-      </td>
-      <td>
-        <b>API 模型</b>
-      </td>
-      <!-- <td>
-        <b>自定义模型</b>
-      </td> -->
-    </tr>
-    <tr valign="top">
-      <td>
-
-- LLaMA
-- Vicuna
-- Alpaca
-- Baichuan
-- WizardLM
-- ChatGLM-6B
-- ChatGLM2-6B
-- MPT
-- Falcon
-- TigerBot
-- MOSS
-- ……
-
-</td>
-<td>
-
-- OpenAI
-- Claude (即将推出)
-- PaLM (即将推出)
-- ……
-
-</td>
-<!-- <td>
-
-- GLM
-- ……
-
-</td> -->
-</tr>
-  </tbody>
-</table>
-
 ## 🛠️ 安装
 
 下面展示了快速安装以及准备数据集的步骤。
@@ -362,6 +124,316 @@ python run.py --datasets ceval_ppl mmlu_ppl \
 
 更多教程请查看我们的[文档](https://opencompass.readthedocs.io/zh_CN/latest/index.html)。
 
+<p align="right"><a href="#top">🔝返回顶部</a></p>
+
+## 📖 数据集支持
+
+<table align="center">
+  <tbody>
+    <tr align="center" valign="bottom">
+      <td>
+        <b>语言</b>
+      </td>
+      <td>
+        <b>知识</b>
+      </td>
+      <td>
+        <b>推理</b>
+      </td>
+      <td>
+        <b>考试</b>
+      </td>
+    </tr>
+    <tr valign="top">
+      <td>
+<details open>
+<summary><b>字词释义</b></summary>
+
+- WiC
+- SummEdits
+
+</details>
+
+<details open>
+<summary><b>成语习语</b></summary>
+
+- CHID
+
+</details>
+
+<details open>
+<summary><b>语义相似度</b></summary>
+
+- AFQMC
+- BUSTM
+
+</details>
+
+<details open>
+<summary><b>指代消解</b></summary>
+
+- CLUEWSC
+- WSC
+- WinoGrande
+
+</details>
+
+<details open>
+<summary><b>翻译</b></summary>
+
+- Flores
+- IWSLT2017
+
+</details>
+
+<details open>
+<summary><b>多语种问答</b></summary>
+
+- TyDi-QA
+- XCOPA
+
+</details>
+
+<details open>
+<summary><b>多语种总结</b></summary>
+
+- XLSum
+
+</details>
+      </td>
+      <td>
+<details open>
+<summary><b>知识问答</b></summary>
+
+- BoolQ
+- CommonSenseQA
+- NaturalQuestions
+- TriviaQA
+
+</details>
+      </td>
+      <td>
+<details open>
+<summary><b>文本蕴含</b></summary>
+
+- CMNLI
+- OCNLI
+- OCNLI_FC
+- AX-b
+- AX-g
+- CB
+- RTE
+- ANLI
+
+</details>
+
+<details open>
+<summary><b>常识推理</b></summary>
+
+- StoryCloze
+- COPA
+- ReCoRD
+- HellaSwag
+- PIQA
+- SIQA
+
+</details>
+
+<details open>
+<summary><b>数学推理</b></summary>
+
+- MATH
+- GSM8K
+
+</details>
+
+<details open>
+<summary><b>定理应用</b></summary>
+
+- TheoremQA
+- StrategyQA
+- SciBench
+
+</details>
+
+<details open>
+<summary><b>综合推理</b></summary>
+
+- BBH
+
+</details>
+      </td>
+      <td>
+<details open>
+<summary><b>初中/高中/大学/职业考试</b></summary>
+
+- C-Eval
+- AGIEval
+- MMLU
+- GAOKAO-Bench
+- CMMLU
+- ARC
+- Xiezhi
+
+</details>
+
+<details open>
+<summary><b>医学考试</b></summary>
+
+- CMB
+
+</details>
+      </td>
+    </tr>
+</td>
+    </tr>
+  </tbody>
+  <tbody>
+    <tr align="center" valign="bottom">
+      <td>
+        <b>理解</b>
+      </td>
+      <td>
+        <b>长文本</b>
+      </td>
+      <td>
+        <b>安全</b>
+      </td>
+      <td>
+        <b>代码</b>
+      </td>
+    </tr>
+    <tr valign="top">
+      <td>
+<details open>
+<summary><b>阅读理解</b></summary>
+
+- C3
+- CMRC
+- DRCD
+- MultiRC
+- RACE
+- DROP
+- OpenBookQA
+- SQuAD2.0
+
+</details>
+
+<details open>
+<summary><b>内容总结</b></summary>
+
+- CSL
+- LCSTS
+- XSum
+- SummScreen
+
+</details>
+
+<details open>
+<summary><b>内容分析</b></summary>
+
+- EPRSTMT
+- LAMBADA
+- TNEWS
+
+</details>
+      </td>
+      <td>
+<details open>
+<summary><b>长文本理解</b></summary>
+
+- LEval
+- LongBench
+- GovReports
+- NarrativeQA
+- Qasper
+
+</details>
+      </td>
+      <td>
+<details open>
+<summary><b>安全</b></summary>
+
+- CivilComments
+- CrowsPairs
+- CValues
+- JigsawMultilingual
+- TruthfulQA
+
+</details>
+<details open>
+<summary><b>健壮性</b></summary>
+
+- AdvGLUE
+
+</details>
+      </td>
+      <td>
+<details open>
+<summary><b>代码</b></summary>
+
+- HumanEval
+- HumanEvalX
+- MBPP
+- APPs
+- DS1000
+
+</details>
+      </td>
+    </tr>
+</td>
+    </tr>
+  </tbody>
+</table>
+
+<p align="right"><a href="#top">🔝返回顶部</a></p>
+
+## 📖 模型支持
+
+<table align="center">
+  <tbody>
+    <tr align="center" valign="bottom">
+      <td>
+        <b>开源模型</b>
+      </td>
+      <td>
+        <b>API 模型</b>
+      </td>
+      <!-- <td>
+        <b>自定义模型</b>
+      </td> -->
+    </tr>
+    <tr valign="top">
+      <td>
+
+- InternLM
+- LLaMA
+- Vicuna
+- Alpaca
+- Baichuan
+- WizardLM
+- ChatGLM2
+- Falcon
+- TigerBot
+- Qwen
+- ……
+
+</td>
+<td>
+
+- OpenAI
+- Claude
+- PaLM (即将推出)
+- ……
+
+</td>
+
+</tr>
+  </tbody>
+</table>
+
+<p align="right"><a href="#top">🔝返回顶部</a></p>
+
 ## 🔜 路线图
 
 - [ ] 主观评测

- Language -	- Knowledge -	- Reasoning -	- Comprehensive Examination -	- Understanding -
- - Word Definition - -- WiC -- SummEdits - - - - - Idiom Learning - -- CHID - - - - - Semantic Similarity - -- AFQMC -- BUSTM - - - - - Coreference Resolution - -- CLUEWSC -- WSC -- WinoGrande - - - - - Translation - -- Flores - - -	- - Knowledge Question Answering - -- BoolQ -- CommonSenseQA -- NaturalQuestion -- TrivialQA - - - - - Multi-language Question Answering - -- TyDi-QA - - -	- - Textual Entailment - -- CMNLI -- OCNLI -- OCNLI_FC -- AX-b -- AX-g -- CB -- RTE - - - - - Commonsense Reasoning - -- StoryCloze -- StoryCloze-CN (coming soon) -- COPA -- ReCoRD -- HellaSwag -- PIQA -- SIQA - - - - - Mathematical Reasoning - -- MATH -- GSM8K - - - - - Theorem Application - -- TheoremQA - - - - - Code - -- HumanEval -- MBPP - - - - - Comprehensive Reasoning - -- BBH - - -	- - Junior High, High School, University, Professional Examinations - -- GAOKAO-2023 -- CEval -- AGIEval -- MMLU -- GAOKAO-Bench -- CMMLU -- ARC - - -	- - Reading Comprehension - -- C3 -- CMRC -- DRCD -- MultiRC -- RACE - - - - - Content Summary - -- CSL -- LCSTS -- XSum - - - - - Content Analysis - -- EPRSTMT -- LAMBADA -- TNEWS - - -
- Open-source Models -	- API Models -
- -- InternLM -- LLaMA -- Vicuna -- Alpaca -- Baichuan -- WizardLM -- ChatGLM-6B -- ChatGLM2-6B -- MPT -- Falcon -- TigerBot -- MOSS -- ... - -	- -- OpenAI -- Claude (coming soon) -- PaLM (coming soon) -- …… - -
+ Language +	+ Knowledge +	+ Reasoning +	+ Examination +
+ + Word Definition + +- WiC +- SummEdits + + + + + Idiom Learning + +- CHID + + + + + Semantic Similarity + +- AFQMC +- BUSTM + + + + + Coreference Resolution + +- CLUEWSC +- WSC +- WinoGrande + + + + + Translation + +- Flores +- IWSLT2017 + + + + + Multi-language Question Answering + +- TyDi-QA +- XCOPA + + + + + Multi-language Summary + +- XLSum + + +	+ + Knowledge Question Answering + +- BoolQ +- CommonSenseQA +- NaturalQuestions +- TriviaQA + + +	+ + Textual Entailment + +- CMNLI +- OCNLI +- OCNLI_FC +- AX-b +- AX-g +- CB +- RTE +- ANLI + + + + + Commonsense Reasoning + +- StoryCloze +- COPA +- ReCoRD +- HellaSwag +- PIQA +- SIQA + + + + + Mathematical Reasoning + +- MATH +- GSM8K + + + + + Theorem Application + +- TheoremQA +- StrategyQA +- SciBench + + + + + Comprehensive Reasoning + +- BBH + + +	+ + Junior High, High School, University, Professional Examinations + +- C-Eval +- AGIEval +- MMLU +- GAOKAO-Bench +- CMMLU +- ARC +- Xiezhi + + + + + Medical Examinations + +- CMB + + +
+ Understanding +	+ Long Context +	+ Safety +	+ Code +
+ + Reading Comprehension + +- C3 +- CMRC +- DRCD +- MultiRC +- RACE +- DROP +- OpenBookQA +- SQuAD2.0 + + + + + Content Summary + +- CSL +- LCSTS +- XSum +- SummScreen + + + + + Content Analysis + +- EPRSTMT +- LAMBADA +- TNEWS + + +	+ + Long Context Understanding + +- LEval +- LongBench +- GovReports +- NarrativeQA +- Qasper + + +	+ + Safety + +- CivilComments +- CrowsPairs +- CValues +- JigsawMultilingual +- TruthfulQA + + + + Robustness + +- AdvGLUE + + +	+ + Code + +- HumanEval +- HumanEvalX +- MBPP +- APPs +- DS1000 + + +
+ Open-source Models +	+ API Models +
+ +- InternLM +- LLaMA +- Vicuna +- Alpaca +- Baichuan +- WizardLM +- ChatGLM2 +- Falcon +- TigerBot +- Qwen +- ... + +	+ +- OpenAI +- Claude +- PaLM (coming soon) +- …… + +
- 语言 -	- 知识 -	- 推理 -	- 学科 -	- 理解 -
- - 字词释义 - -- WiC -- SummEdits - - - - - 成语习语 - -- CHID - - - - - 语义相似度 - -- AFQMC -- BUSTM - - - - - 指代消解 - -- CLUEWSC -- WSC -- WinoGrande - - - - - 翻译 - -- Flores - - -	- - 知识问答 - -- BoolQ -- CommonSenseQA -- NaturalQuestion -- TrivialQA - - - - - 多语种问答 - -- TyDi-QA - - -	- - 文本蕴含 - -- CMNLI -- OCNLI -- OCNLI_FC -- AX-b -- AX-g -- CB -- RTE - - - - - 常识推理 - -- StoryCloze -- StoryCloze-CN（即将上线） -- COPA -- ReCoRD -- HellaSwag -- PIQA -- SIQA - - - - - 数学推理 - -- MATH -- GSM8K - - - - - 定理应用 - -- TheoremQA - - - - - 代码 - -- HumanEval -- MBPP - - - - - 综合推理 - -- BBH - - -	- - 初中/高中/大学/职业考试 - -- GAOKAO-2023 -- CEval -- AGIEval -- MMLU -- GAOKAO-Bench -- CMMLU -- ARC - - -	- - 阅读理解 - -- C3 -- CMRC -- DRCD -- MultiRC -- RACE - - - - - 内容总结 - -- CSL -- LCSTS -- XSum - - - - - 内容分析 - -- EPRSTMT -- LAMBADA -- TNEWS - - -
- 开源模型 -	- API 模型 -
- -- LLaMA -- Vicuna -- Alpaca -- Baichuan -- WizardLM -- ChatGLM-6B -- ChatGLM2-6B -- MPT -- Falcon -- TigerBot -- MOSS -- …… - -	- -- OpenAI -- Claude (即将推出) -- PaLM (即将推出) -- …… - -
+ 语言 +	+ 知识 +	+ 推理 +	+ 考试 +
+ + 字词释义 + +- WiC +- SummEdits + + + + + 成语习语 + +- CHID + + + + + 语义相似度 + +- AFQMC +- BUSTM + + + + + 指代消解 + +- CLUEWSC +- WSC +- WinoGrande + + + + + 翻译 + +- Flores +- IWSLT2017 + + + + + 多语种问答 + +- TyDi-QA +- XCOPA + + + + + 多语种总结 + +- XLSum + + +	+ + 知识问答 + +- BoolQ +- CommonSenseQA +- NaturalQuestions +- TriviaQA + + +	+ + 文本蕴含 + +- CMNLI +- OCNLI +- OCNLI_FC +- AX-b +- AX-g +- CB +- RTE +- ANLI + + + + + 常识推理 + +- StoryCloze +- COPA +- ReCoRD +- HellaSwag +- PIQA +- SIQA + + + + + 数学推理 + +- MATH +- GSM8K + + + + + 定理应用 + +- TheoremQA +- StrategyQA +- SciBench + + + + + 综合推理 + +- BBH + + +	+ + 初中/高中/大学/职业考试 + +- C-Eval +- AGIEval +- MMLU +- GAOKAO-Bench +- CMMLU +- ARC +- Xiezhi + + + + + 医学考试 + +- CMB + + +
+ 理解 +	+ 长文本 +	+ 安全 +	+ 代码 +
+ + 阅读理解 + +- C3 +- CMRC +- DRCD +- MultiRC +- RACE +- DROP +- OpenBookQA +- SQuAD2.0 + + + + + 内容总结 + +- CSL +- LCSTS +- XSum +- SummScreen + + + + + 内容分析 + +- EPRSTMT +- LAMBADA +- TNEWS + + +	+ + 长文本理解 + +- LEval +- LongBench +- GovReports +- NarrativeQA +- Qasper + + +	+ + 安全 + +- CivilComments +- CrowsPairs +- CValues +- JigsawMultilingual +- TruthfulQA + + + + 健壮性 + +- AdvGLUE + + +	+ + 代码 + +- HumanEval +- HumanEvalX +- MBPP +- APPs +- DS1000 + + +
+ 开源模型 +	+ API 模型 +
+ +- InternLM +- LLaMA +- Vicuna +- Alpaca +- Baichuan +- WizardLM +- ChatGLM2 +- Falcon +- TigerBot +- Qwen +- …… + +	+ +- OpenAI +- Claude +- PaLM (即将推出) +- …… + +