[Doc] Update dataset list (#437)

* add new dataset list * add new dataset list * add new dataset list * update * update * update readme --------- Co-authored-by: gaotongxiao <gaotongxiao@gmail.com>
2025-05-30 16:03:24 +08:00 · 2023-09-27 15:02:09 +08:00 · 2023-09-27 15:02:09 +08:00 · d6261e109d
commit d6261e109d
parent dc1b82c346
2 changed files with 628 additions and 486 deletions
--- a/README.md
+++ b/README.md
@ -34,9 +34,10 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
 ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
 - **\[2023.09.26\]** We update the leaderboard with [Qwen](https://github.com/QwenLM/Qwen), one of the best-performing open-source models currently available, welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
 - **\[2023.09.20\]** We update the leaderboard with [InternLM-20B](https://github.com/InternLM/InternLM), welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥.
+- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md). 🔥🔥🔥.
+- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md).
 - **\[2023.09.08\]** We update the leaderboard with Baichuan-2/Tigerbot-2/Vicuna-v1.5, welcome to our [homepage](https://opencompass.org.cn) for more details.
 - **\[2023.09.06\]**  [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation.
 - **\[2023.09.02\]** We have supported the evaluation of [Qwen-VL](https://github.com/QwenLM/Qwen-VL) in OpenCompass.
@ -51,7 +52,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
 OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features includes:
- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 50+ datasets with about 300,000 questions, comprehensively evaluating the capabilities of the models in five dimensions.
+- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 70+ datasets with about 400,000 questions, comprehensively evaluating the capabilities of the models in five dimensions.
 - **Efficient distributed evaluation**: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours.
@ -67,247 +68,6 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun
 <p align="right"><a href="#top">🔝Back to top</a></p>
 ## 📖 Dataset Support
 <table align="center">
  <tbody>
    <tr align="center" valign="bottom">
      <td>
        <b>Language</b>
      </td>
      <td>
        <b>Knowledge</b>
      </td>
      <td>
        <b>Reasoning</b>
      </td>
      <td>
        <b>Comprehensive Examination</b>
      </td>
      <td>
        <b>Understanding</b>
      </td>
    </tr>
    <tr valign="top">
      <td>
 <details open>
 <summary><b>Word Definition</b></summary>
 - WiC
 - SummEdits
 </details>
 <details open>
 <summary><b>Idiom Learning</b></summary>
 - CHID
 </details>
 <details open>
 <summary><b>Semantic Similarity</b></summary>
 - AFQMC
 - BUSTM
 </details>
 <details open>
 <summary><b>Coreference Resolution</b></summary>
 - CLUEWSC
 - WSC
 - WinoGrande
 </details>
 <details open>
 <summary><b>Translation</b></summary>
 - Flores
 </details>
      </td>
      <td>
 <details open>
 <summary><b>Knowledge Question Answering</b></summary>
 - BoolQ
 - CommonSenseQA
 - NaturalQuestion
 - TrivialQA
 </details>
 <details open>
 <summary><b>Multi-language Question Answering</b></summary>
 - TyDi-QA
 </details>
      </td>
      <td>
 <details open>
 <summary><b>Textual Entailment</b></summary>
 - CMNLI
 - OCNLI
 - OCNLI_FC
 - AX-b
 - AX-g
 - CB
 - RTE
 </details>
 <details open>
 <summary><b>Commonsense Reasoning</b></summary>
 - StoryCloze
 - StoryCloze-CN (coming soon)
 - COPA
 - ReCoRD
 - HellaSwag
 - PIQA
 - SIQA
 </details>
 <details open>
 <summary><b>Mathematical Reasoning</b></summary>
 - MATH
 - GSM8K
 </details>
 <details open>
 <summary><b>Theorem Application</b></summary>
 - TheoremQA
 </details>
 <details open>
 <summary><b>Code</b></summary>
 - HumanEval
 - MBPP
 </details>
 <details open>
 <summary><b>Comprehensive Reasoning</b></summary>
 - BBH
 </details>
      </td>
      <td>
 <details open>
 <summary><b>Junior High, High School, University, Professional Examinations</b></summary>
 - GAOKAO-2023
 - CEval
 - AGIEval
 - MMLU
 - GAOKAO-Bench
 - CMMLU
 - ARC
 </details>
      </td>
      <td>
 <details open>
 <summary><b>Reading Comprehension</b></summary>
 - C3
 - CMRC
 - DRCD
 - MultiRC
 - RACE
 </details>
 <details open>
 <summary><b>Content Summary</b></summary>
 - CSL
 - LCSTS
 - XSum
 </details>
 <details open>
 <summary><b>Content Analysis</b></summary>
 - EPRSTMT
 - LAMBADA
 - TNEWS
 </details>
      </td>
    </tr>
 </td>
    </tr>
  </tbody>
 </table>
 <p align="right"><a href="#top">🔝Back to top</a></p>
 ## 📖 Model Support
 <table align="center">
  <tbody>
    <tr align="center" valign="bottom">
      <td>
        <b>Open-source Models</b>
      </td>
      <td>
        <b>API Models</b>
      </td>
      <!-- <td>
        <b>Custom Models</b>
      </td> -->
    </tr>
    <tr valign="top">
      <td>
 - InternLM
 - LLaMA
 - Vicuna
 - Alpaca
 - Baichuan
 - WizardLM
 - ChatGLM-6B
 - ChatGLM2-6B
 - MPT
 - Falcon
 - TigerBot
 - MOSS
 - ...
 </td>
 <td>
 - OpenAI
 - Claude (coming soon)
 - PaLM (coming soon)
 - ……
 </td>
 <!--
 - GLM
 - ...
 </td> -->
 </tr>
  </tbody>
 </table>
 ## 🛠️ Installation
 Below are the steps for quick installation and datasets preparation.
@ -360,6 +120,316 @@ python run.py --datasets ceval_ppl mmlu_ppl \
 Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started.html) to learn how to run an evaluation task.
 <p align="right"><a href="#top">🔝Back to top</a></p>
 ## 📖 Dataset Support
 <table align="center">
  <tbody>
    <tr align="center" valign="bottom">
      <td>
        <b>Language</b>
      </td>
      <td>
        <b>Knowledge</b>
      </td>
      <td>
        <b>Reasoning</b>
      </td>
      <td>
        <b>Examination</b>
      </td>
    </tr>
    <tr valign="top">
      <td>
 <details open>
 <summary><b>Word Definition</b></summary>
 - WiC
 - SummEdits
 </details>
 <details open>
 <summary><b>Idiom Learning</b></summary>
 - CHID
 </details>
 <details open>
 <summary><b>Semantic Similarity</b></summary>
 - AFQMC
 - BUSTM
 </details>
 <details open>
 <summary><b>Coreference Resolution</b></summary>
 - CLUEWSC
 - WSC
 - WinoGrande
 </details>
 <details open>
 <summary><b>Translation</b></summary>
 - Flores
 - IWSLT2017
 </details>
 <details open>
 <summary><b>Multi-language Question Answering</b></summary>
 - TyDi-QA
 - XCOPA
 </details>
 <details open>
 <summary><b>Multi-language Summary</b></summary>
 - XLSum
 </details>
      </td>
      <td>
 <details open>
 <summary><b>Knowledge Question Answering</b></summary>
 - BoolQ
 - CommonSenseQA
 - NaturalQuestions
 - TriviaQA
 </details>
      </td>
      <td>
 <details open>
 <summary><b>Textual Entailment</b></summary>
 - CMNLI
 - OCNLI
 - OCNLI_FC
 - AX-b
 - AX-g
 - CB
 - RTE
 - ANLI
 </details>
 <details open>
 <summary><b>Commonsense Reasoning</b></summary>
 - StoryCloze
 - COPA
 - ReCoRD
 - HellaSwag
 - PIQA
 - SIQA
 </details>
 <details open>
 <summary><b>Mathematical Reasoning</b></summary>
 - MATH
 - GSM8K
 </details>
 <details open>
 <summary><b>Theorem Application</b></summary>
 - TheoremQA
 - StrategyQA
 - SciBench
 </details>
 <details open>
 <summary><b>Comprehensive Reasoning</b></summary>
 - BBH
 </details>
      </td>
      <td>
 <details open>
 <summary><b>Junior High, High School, University, Professional Examinations</b></summary>
 - C-Eval
 - AGIEval
 - MMLU
 - GAOKAO-Bench
 - CMMLU
 - ARC
 - Xiezhi
 </details>
 <details open>
 <summary><b>Medical Examinations</b></summary>
 - CMB
 </details>
      </td>
    </tr>
 </td>
    </tr>
  </tbody>
  <tbody>
    <tr align="center" valign="bottom">
      <td>
        <b>Understanding</b>
      </td>
      <td>
        <b>Long Context</b>
      </td>
      <td>
        <b>Safety</b>
      </td>
      <td>
        <b>Code</b>
      </td>
    </tr>
    <tr valign="top">
      <td>
 <details open>
 <summary><b>Reading Comprehension</b></summary>
 - C3
 - CMRC
 - DRCD
 - MultiRC
 - RACE
 - DROP
 - OpenBookQA
 - SQuAD2.0
 </details>
 <details open>
 <summary><b>Content Summary</b></summary>
 - CSL
 - LCSTS
 - XSum
 - SummScreen
 </details>
 <details open>
 <summary><b>Content Analysis</b></summary>
 - EPRSTMT
 - LAMBADA
 - TNEWS
 </details>
      </td>
      <td>
 <details open>
 <summary><b>Long Context Understanding</b></summary>
 - LEval
 - LongBench
 - GovReports
 - NarrativeQA
 - Qasper
 </details>
      </td>
      <td>
 <details open>
 <summary><b>Safety</b></summary>
 - CivilComments
 - CrowsPairs
 - CValues
 - JigsawMultilingual
 - TruthfulQA
 </details>
 <details open>
 <summary><b>Robustness</b></summary>
 - AdvGLUE
 </details>
      </td>
      <td>
 <details open>
 <summary><b>Code</b></summary>
 - HumanEval
 - HumanEvalX
 - MBPP
 - APPs
 - DS1000
 </details>
      </td>
    </tr>
 </td>
    </tr>
  </tbody>
 </table>
 <p align="right"><a href="#top">🔝Back to top</a></p>
 ## 📖 Model Support
 <table align="center">
  <tbody>
    <tr align="center" valign="bottom">
      <td>
        <b>Open-source Models</b>
      </td>
      <td>
        <b>API Models</b>
      </td>
      <!-- <td>
        <b>Custom Models</b>
      </td> -->
    </tr>
    <tr valign="top">
      <td>
 - InternLM
 - LLaMA
 - Vicuna
 - Alpaca
 - Baichuan
 - WizardLM
 - ChatGLM2
 - Falcon
 - TigerBot
 - Qwen
 - ...
 </td>
 <td>
 - OpenAI
 - Claude
 - PaLM (coming soon)
 - ……
 </td>
 </tr>
  </tbody>
 </table>
 <p align="right"><a href="#top">🔝Back to top</a></p>
 ## 🔜 Roadmap
 - [ ] Subjective Evaluation
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -34,9 +34,10 @@
 ## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
 - **\[2023.09.26\]** 我们在评测榜单上更新了[Qwen](https://github.com/QwenLM/Qwen), 这是目前表现最好的开源模型之一, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
 - **\[2023.09.20\]** 我们在评测榜单上更新了[InternLM-20B](https://github.com/InternLM/InternLM), 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
- **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥.
+- **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.
- **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md).🔥🔥🔥.
+- **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md).
 - **\[2023.09.08\]** 我们在评测榜单上更新了Baichuan-2/Tigerbot-2/Vicuna-v1.5, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情。
 - **\[2023.09.06\]** 欢迎 [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) 团队采用OpenCompass对模型进行系统评估。我们非常感谢社区在提升LLM评估的透明度和可复现性上所做的努力。
 - **\[2023.09.02\]** 我们加入了[Qwen-VL](https://github.com/QwenLM/Qwen-VL)的评测支持。
@ -53,7 +54,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
 - **开源可复现**：提供公平、公开、可复现的大模型评测方案
- **全面的能力维度**：五大维度设计，提供 50+ 个数据集约 30 万题的的模型评测方案，全面评估模型能力
+- **全面的能力维度**：五大维度设计，提供 70+ 个数据集约 40 万题的的模型评测方案，全面评估模型能力
 - **丰富的模型支持**：已支持 20+ HuggingFace 及 API 模型
@ -69,245 +70,6 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
 <p align="right"><a href="#top">🔝返回顶部</a></p>
 ## 📖 数据集支持
 <table align="center">
  <tbody>
    <tr align="center" valign="bottom">
      <td>
        <b>语言</b>
      </td>
      <td>
        <b>知识</b>
      </td>
      <td>
        <b>推理</b>
      </td>
      <td>
        <b>学科</b>
      </td>
      <td>
        <b>理解</b>
      </td>
    </tr>
    <tr valign="top">
      <td>
 <details open>
 <summary><b>字词释义</b></summary>
 - WiC
 - SummEdits
 </details>
 <details open>
 <summary><b>成语习语</b></summary>
 - CHID
 </details>
 <details open>
 <summary><b>语义相似度</b></summary>
 - AFQMC
 - BUSTM
 </details>
 <details open>
 <summary><b>指代消解</b></summary>
 - CLUEWSC
 - WSC
 - WinoGrande
 </details>
 <details open>
 <summary><b>翻译</b></summary>
 - Flores
 </details>
      </td>
      <td>
 <details open>
 <summary><b>知识问答</b></summary>
 - BoolQ
 - CommonSenseQA
 - NaturalQuestion
 - TrivialQA
 </details>
 <details open>
 <summary><b>多语种问答</b></summary>
 - TyDi-QA
 </details>
      </td>
      <td>
 <details open>
 <summary><b>文本蕴含</b></summary>
 - CMNLI
 - OCNLI
 - OCNLI_FC
 - AX-b
 - AX-g
 - CB
 - RTE
 </details>
 <details open>
 <summary><b>常识推理</b></summary>
 - StoryCloze
 - StoryCloze-CN（即将上线）
 - COPA
 - ReCoRD
 - HellaSwag
 - PIQA
 - SIQA
 </details>
 <details open>
 <summary><b>数学推理</b></summary>
 - MATH
 - GSM8K
 </details>
 <details open>
 <summary><b>定理应用</b></summary>
 - TheoremQA
 </details>
 <details open>
 <summary><b>代码</b></summary>
 - HumanEval
 - MBPP
 </details>
 <details open>
 <summary><b>综合推理</b></summary>
 - BBH
 </details>
      </td>
      <td>
 <details open>
 <summary><b>初中/高中/大学/职业考试</b></summary>
 - GAOKAO-2023
 - CEval
 - AGIEval
 - MMLU
 - GAOKAO-Bench
 - CMMLU
 - ARC
 </details>
      </td>
      <td>
 <details open>
 <summary><b>阅读理解</b></summary>
 - C3
 - CMRC
 - DRCD
 - MultiRC
 - RACE
 </details>
 <details open>
 <summary><b>内容总结</b></summary>
 - CSL
 - LCSTS
 - XSum
 </details>
 <details open>
 <summary><b>内容分析</b></summary>
 - EPRSTMT
 - LAMBADA
 - TNEWS
 </details>
      </td>
    </tr>
 </td>
    </tr>
  </tbody>
 </table>
 <p align="right"><a href="#top">🔝返回顶部</a></p>
 ## 📖 模型支持
 <table align="center">
  <tbody>
    <tr align="center" valign="bottom">
      <td>
        <b>开源模型</b>
      </td>
      <td>
        <b>API 模型</b>
      </td>
      <!-- <td>
        <b>自定义模型</b>
      </td> -->
    </tr>
    <tr valign="top">
      <td>
 - LLaMA
 - Vicuna
 - Alpaca
 - Baichuan
 - WizardLM
 - ChatGLM-6B
 - ChatGLM2-6B
 - MPT
 - Falcon
 - TigerBot
 - MOSS
 - ……
 </td>
 <td>
 - OpenAI
 - Claude (即将推出)
 - PaLM (即将推出)
 - ……
 </td>
 <!-- <td>
 - GLM
 - ……
 </td> -->
 </tr>
  </tbody>
 </table>
 ## 🛠️ 安装
 下面展示了快速安装以及准备数据集的步骤。
@ -362,6 +124,316 @@ python run.py --datasets ceval_ppl mmlu_ppl \
 更多教程请查看我们的[文档](https://opencompass.readthedocs.io/zh_CN/latest/index.html)。
 <p align="right"><a href="#top">🔝返回顶部</a></p>
 ## 📖 数据集支持
 <table align="center">
  <tbody>
    <tr align="center" valign="bottom">
      <td>
        <b>语言</b>
      </td>
      <td>
        <b>知识</b>
      </td>
      <td>
        <b>推理</b>
      </td>
      <td>
        <b>考试</b>
      </td>
    </tr>
    <tr valign="top">
      <td>
 <details open>
 <summary><b>字词释义</b></summary>
 - WiC
 - SummEdits
 </details>
 <details open>
 <summary><b>成语习语</b></summary>
 - CHID
 </details>
 <details open>
 <summary><b>语义相似度</b></summary>
 - AFQMC
 - BUSTM
 </details>
 <details open>
 <summary><b>指代消解</b></summary>
 - CLUEWSC
 - WSC
 - WinoGrande
 </details>
 <details open>
 <summary><b>翻译</b></summary>
 - Flores
 - IWSLT2017
 </details>
 <details open>
 <summary><b>多语种问答</b></summary>
 - TyDi-QA
 - XCOPA
 </details>
 <details open>
 <summary><b>多语种总结</b></summary>
 - XLSum
 </details>
      </td>
      <td>
 <details open>
 <summary><b>知识问答</b></summary>
 - BoolQ
 - CommonSenseQA
 - NaturalQuestions
 - TriviaQA
 </details>
      </td>
      <td>
 <details open>
 <summary><b>文本蕴含</b></summary>
 - CMNLI
 - OCNLI
 - OCNLI_FC
 - AX-b
 - AX-g
 - CB
 - RTE
 - ANLI
 </details>
 <details open>
 <summary><b>常识推理</b></summary>
 - StoryCloze
 - COPA
 - ReCoRD
 - HellaSwag
 - PIQA
 - SIQA
 </details>
 <details open>
 <summary><b>数学推理</b></summary>
 - MATH
 - GSM8K
 </details>
 <details open>
 <summary><b>定理应用</b></summary>
 - TheoremQA
 - StrategyQA
 - SciBench
 </details>
 <details open>
 <summary><b>综合推理</b></summary>
 - BBH
 </details>
      </td>
      <td>
 <details open>
 <summary><b>初中/高中/大学/职业考试</b></summary>
 - C-Eval
 - AGIEval
 - MMLU
 - GAOKAO-Bench
 - CMMLU
 - ARC
 - Xiezhi
 </details>
 <details open>
 <summary><b>医学考试</b></summary>
 - CMB
 </details>
      </td>
    </tr>
 </td>
    </tr>
  </tbody>
  <tbody>
    <tr align="center" valign="bottom">
      <td>
        <b>理解</b>
      </td>
      <td>
        <b>长文本</b>
      </td>
      <td>
        <b>安全</b>
      </td>
      <td>
        <b>代码</b>
      </td>
    </tr>
    <tr valign="top">
      <td>
 <details open>
 <summary><b>阅读理解</b></summary>
 - C3
 - CMRC
 - DRCD
 - MultiRC
 - RACE
 - DROP
 - OpenBookQA
 - SQuAD2.0
 </details>
 <details open>
 <summary><b>内容总结</b></summary>
 - CSL
 - LCSTS
 - XSum
 - SummScreen
 </details>
 <details open>
 <summary><b>内容分析</b></summary>
 - EPRSTMT
 - LAMBADA
 - TNEWS
 </details>
      </td>
      <td>
 <details open>
 <summary><b>长文本理解</b></summary>
 - LEval
 - LongBench
 - GovReports
 - NarrativeQA
 - Qasper
 </details>
      </td>
      <td>
 <details open>
 <summary><b>安全</b></summary>
 - CivilComments
 - CrowsPairs
 - CValues
 - JigsawMultilingual
 - TruthfulQA
 </details>
 <details open>
 <summary><b>健壮性</b></summary>
 - AdvGLUE
 </details>
      </td>
      <td>
 <details open>
 <summary><b>代码</b></summary>
 - HumanEval
 - HumanEvalX
 - MBPP
 - APPs
 - DS1000
 </details>
      </td>
    </tr>
 </td>
    </tr>
  </tbody>
 </table>
 <p align="right"><a href="#top">🔝返回顶部</a></p>
 ## 📖 模型支持
 <table align="center">
  <tbody>
    <tr align="center" valign="bottom">
      <td>
        <b>开源模型</b>
      </td>
      <td>
        <b>API 模型</b>
      </td>
      <!-- <td>
        <b>自定义模型</b>
      </td> -->
    </tr>
    <tr valign="top">
      <td>
 - InternLM
 - LLaMA
 - Vicuna
 - Alpaca
 - Baichuan
 - WizardLM
 - ChatGLM2
 - Falcon
 - TigerBot
 - Qwen
 - ……
 </td>
 <td>
 - OpenAI
 - Claude
 - PaLM (即将推出)
 - ……
 </td>
 </tr>
  </tbody>
 </table>
 <p align="right"><a href="#top">🔝返回顶部</a></p>
 ## 🔜 路线图
 - [ ] 主观评测