From d6261e109d5dd6babdae893c9472d27fc8d71d83 Mon Sep 17 00:00:00 2001 From: Leymore Date: Wed, 27 Sep 2023 15:02:09 +0800 Subject: [PATCH] [Doc] Update dataset list (#437) * add new dataset list * add new dataset list * add new dataset list * update * update * update readme --------- Co-authored-by: gaotongxiao --- README.md | 558 +++++++++++++++++++++++++++--------------------- README_zh-CN.md | 556 ++++++++++++++++++++++++++--------------------- 2 files changed, 628 insertions(+), 486 deletions(-) diff --git a/README.md b/README.md index 64ac4b13..6fbf1011 100644 --- a/README.md +++ b/README.md @@ -34,9 +34,10 @@ Just like a compass guides us on our journey, OpenCompass will guide you through ## 🚀 What's New +- **\[2023.09.26\]** We update the leaderboard with [Qwen](https://github.com/QwenLM/Qwen), one of the best-performing open-source models currently available, welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥. - **\[2023.09.20\]** We update the leaderboard with [InternLM-20B](https://github.com/InternLM/InternLM), welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥. -- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details. 🔥🔥🔥. -- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md). 🔥🔥🔥. +- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details. +- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md). - **\[2023.09.08\]** We update the leaderboard with Baichuan-2/Tigerbot-2/Vicuna-v1.5, welcome to our [homepage](https://opencompass.org.cn) for more details. - **\[2023.09.06\]** [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) team adpots OpenCompass to evaluate their models systematically. We deeply appreciate the community's dedication to transparency and reproducibility in LLM evaluation. - **\[2023.09.02\]** We have supported the evaluation of [Qwen-VL](https://github.com/QwenLM/Qwen-VL) in OpenCompass. @@ -51,7 +52,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through OpenCompass is a one-stop platform for large model evaluation, aiming to provide a fair, open, and reproducible benchmark for large model evaluation. Its main features includes: -- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 50+ datasets with about 300,000 questions, comprehensively evaluating the capabilities of the models in five dimensions. +- **Comprehensive support for models and datasets**: Pre-support for 20+ HuggingFace and API models, a model evaluation scheme of 70+ datasets with about 400,000 questions, comprehensively evaluating the capabilities of the models in five dimensions. - **Efficient distributed evaluation**: One line command to implement task division and distributed evaluation, completing the full evaluation of billion-scale models in just a few hours. @@ -67,247 +68,6 @@ We provide [OpenCompass Leaderbaord](https://opencompass.org.cn/rank) for commun

🔝Back to top

-## 📖 Dataset Support - - - - - - - - - - - - - - - - - - - - -
- Language - - Knowledge - - Reasoning - - Comprehensive Examination - - Understanding -
-
-Word Definition - -- WiC -- SummEdits - -
- -
-Idiom Learning - -- CHID - -
- -
-Semantic Similarity - -- AFQMC -- BUSTM - -
- -
-Coreference Resolution - -- CLUEWSC -- WSC -- WinoGrande - -
- -
-Translation - -- Flores - -
-
-
-Knowledge Question Answering - -- BoolQ -- CommonSenseQA -- NaturalQuestion -- TrivialQA - -
- -
-Multi-language Question Answering - -- TyDi-QA - -
-
-
-Textual Entailment - -- CMNLI -- OCNLI -- OCNLI_FC -- AX-b -- AX-g -- CB -- RTE - -
- -
-Commonsense Reasoning - -- StoryCloze -- StoryCloze-CN (coming soon) -- COPA -- ReCoRD -- HellaSwag -- PIQA -- SIQA - -
- -
-Mathematical Reasoning - -- MATH -- GSM8K - -
- -
-Theorem Application - -- TheoremQA - -
- -
-Code - -- HumanEval -- MBPP - -
- -
-Comprehensive Reasoning - -- BBH - -
-
-
-Junior High, High School, University, Professional Examinations - -- GAOKAO-2023 -- CEval -- AGIEval -- MMLU -- GAOKAO-Bench -- CMMLU -- ARC - -
-
-
-Reading Comprehension - -- C3 -- CMRC -- DRCD -- MultiRC -- RACE - -
- -
-Content Summary - -- CSL -- LCSTS -- XSum - -
- -
-Content Analysis - -- EPRSTMT -- LAMBADA -- TNEWS - -
-
- -

🔝Back to top

- -## 📖 Model Support - - - - - - - - - - - - - - - - -
- Open-source Models - - API Models -
- -- InternLM -- LLaMA -- Vicuna -- Alpaca -- Baichuan -- WizardLM -- ChatGLM-6B -- ChatGLM2-6B -- MPT -- Falcon -- TigerBot -- MOSS -- ... - - - -- OpenAI -- Claude (coming soon) -- PaLM (coming soon) -- …… - -
- ## 🛠️ Installation Below are the steps for quick installation and datasets preparation. @@ -360,6 +120,316 @@ python run.py --datasets ceval_ppl mmlu_ppl \ Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started.html) to learn how to run an evaluation task. +

🔝Back to top

+ +## 📖 Dataset Support + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ Language + + Knowledge + + Reasoning + + Examination +
+
+Word Definition + +- WiC +- SummEdits + +
+ +
+Idiom Learning + +- CHID + +
+ +
+Semantic Similarity + +- AFQMC +- BUSTM + +
+ +
+Coreference Resolution + +- CLUEWSC +- WSC +- WinoGrande + +
+ +
+Translation + +- Flores +- IWSLT2017 + +
+ +
+Multi-language Question Answering + +- TyDi-QA +- XCOPA + +
+ +
+Multi-language Summary + +- XLSum + +
+
+
+Knowledge Question Answering + +- BoolQ +- CommonSenseQA +- NaturalQuestions +- TriviaQA + +
+
+
+Textual Entailment + +- CMNLI +- OCNLI +- OCNLI_FC +- AX-b +- AX-g +- CB +- RTE +- ANLI + +
+ +
+Commonsense Reasoning + +- StoryCloze +- COPA +- ReCoRD +- HellaSwag +- PIQA +- SIQA + +
+ +
+Mathematical Reasoning + +- MATH +- GSM8K + +
+ +
+Theorem Application + +- TheoremQA +- StrategyQA +- SciBench + +
+ +
+Comprehensive Reasoning + +- BBH + +
+
+
+Junior High, High School, University, Professional Examinations + +- C-Eval +- AGIEval +- MMLU +- GAOKAO-Bench +- CMMLU +- ARC +- Xiezhi + +
+ +
+Medical Examinations + +- CMB + +
+
+ Understanding + + Long Context + + Safety + + Code +
+
+Reading Comprehension + +- C3 +- CMRC +- DRCD +- MultiRC +- RACE +- DROP +- OpenBookQA +- SQuAD2.0 + +
+ +
+Content Summary + +- CSL +- LCSTS +- XSum +- SummScreen + +
+ +
+Content Analysis + +- EPRSTMT +- LAMBADA +- TNEWS + +
+
+
+Long Context Understanding + +- LEval +- LongBench +- GovReports +- NarrativeQA +- Qasper + +
+
+
+Safety + +- CivilComments +- CrowsPairs +- CValues +- JigsawMultilingual +- TruthfulQA + +
+
+Robustness + +- AdvGLUE + +
+
+
+Code + +- HumanEval +- HumanEvalX +- MBPP +- APPs +- DS1000 + +
+
+ +

🔝Back to top

+ +## 📖 Model Support + + + + + + + + + + + + + + +
+ Open-source Models + + API Models +
+ +- InternLM +- LLaMA +- Vicuna +- Alpaca +- Baichuan +- WizardLM +- ChatGLM2 +- Falcon +- TigerBot +- Qwen +- ... + + + +- OpenAI +- Claude +- PaLM (coming soon) +- …… + +
+ +

🔝Back to top

+ ## 🔜 Roadmap - [ ] Subjective Evaluation diff --git a/README_zh-CN.md b/README_zh-CN.md index be833994..74f06505 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -34,9 +34,10 @@ ## 🚀 最新进展 +- **\[2023.09.26\]** 我们在评测榜单上更新了[Qwen](https://github.com/QwenLM/Qwen), 这是目前表现最好的开源模型之一, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥. - **\[2023.09.20\]** 我们在评测榜单上更新了[InternLM-20B](https://github.com/InternLM/InternLM), 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥. -- **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.🔥🔥🔥. -- **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md).🔥🔥🔥. +- **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情. +- **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md). - **\[2023.09.08\]** 我们在评测榜单上更新了Baichuan-2/Tigerbot-2/Vicuna-v1.5, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情。 - **\[2023.09.06\]** 欢迎 [**Baichuan2**](https://github.com/baichuan-inc/Baichuan2) 团队采用OpenCompass对模型进行系统评估。我们非常感谢社区在提升LLM评估的透明度和可复现性上所做的努力。 - **\[2023.09.02\]** 我们加入了[Qwen-VL](https://github.com/QwenLM/Qwen-VL)的评测支持。 @@ -53,7 +54,7 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下 - **开源可复现**:提供公平、公开、可复现的大模型评测方案 -- **全面的能力维度**:五大维度设计,提供 50+ 个数据集约 30 万题的的模型评测方案,全面评估模型能力 +- **全面的能力维度**:五大维度设计,提供 70+ 个数据集约 40 万题的的模型评测方案,全面评估模型能力 - **丰富的模型支持**:已支持 20+ HuggingFace 及 API 模型 @@ -69,245 +70,6 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下

🔝返回顶部

-## 📖 数据集支持 - - - - - - - - - - - - - - - - - - - - -
- 语言 - - 知识 - - 推理 - - 学科 - - 理解 -
-
-字词释义 - -- WiC -- SummEdits - -
- -
-成语习语 - -- CHID - -
- -
-语义相似度 - -- AFQMC -- BUSTM - -
- -
-指代消解 - -- CLUEWSC -- WSC -- WinoGrande - -
- -
-翻译 - -- Flores - -
-
-
-知识问答 - -- BoolQ -- CommonSenseQA -- NaturalQuestion -- TrivialQA - -
- -
-多语种问答 - -- TyDi-QA - -
-
-
-文本蕴含 - -- CMNLI -- OCNLI -- OCNLI_FC -- AX-b -- AX-g -- CB -- RTE - -
- -
-常识推理 - -- StoryCloze -- StoryCloze-CN(即将上线) -- COPA -- ReCoRD -- HellaSwag -- PIQA -- SIQA - -
- -
-数学推理 - -- MATH -- GSM8K - -
- -
-定理应用 - -- TheoremQA - -
- -
-代码 - -- HumanEval -- MBPP - -
- -
-综合推理 - -- BBH - -
-
-
-初中/高中/大学/职业考试 - -- GAOKAO-2023 -- CEval -- AGIEval -- MMLU -- GAOKAO-Bench -- CMMLU -- ARC - -
-
-
-阅读理解 - -- C3 -- CMRC -- DRCD -- MultiRC -- RACE - -
- -
-内容总结 - -- CSL -- LCSTS -- XSum - -
- -
-内容分析 - -- EPRSTMT -- LAMBADA -- TNEWS - -
-
- -

🔝返回顶部

- -## 📖 模型支持 - - - - - - - - - - - - - - -
- 开源模型 - - API 模型 -
- -- LLaMA -- Vicuna -- Alpaca -- Baichuan -- WizardLM -- ChatGLM-6B -- ChatGLM2-6B -- MPT -- Falcon -- TigerBot -- MOSS -- …… - - - -- OpenAI -- Claude (即将推出) -- PaLM (即将推出) -- …… - -
- ## 🛠️ 安装 下面展示了快速安装以及准备数据集的步骤。 @@ -362,6 +124,316 @@ python run.py --datasets ceval_ppl mmlu_ppl \ 更多教程请查看我们的[文档](https://opencompass.readthedocs.io/zh_CN/latest/index.html)。 +

🔝返回顶部

+ +## 📖 数据集支持 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ 语言 + + 知识 + + 推理 + + 考试 +
+
+字词释义 + +- WiC +- SummEdits + +
+ +
+成语习语 + +- CHID + +
+ +
+语义相似度 + +- AFQMC +- BUSTM + +
+ +
+指代消解 + +- CLUEWSC +- WSC +- WinoGrande + +
+ +
+翻译 + +- Flores +- IWSLT2017 + +
+ +
+多语种问答 + +- TyDi-QA +- XCOPA + +
+ +
+多语种总结 + +- XLSum + +
+
+
+知识问答 + +- BoolQ +- CommonSenseQA +- NaturalQuestions +- TriviaQA + +
+
+
+文本蕴含 + +- CMNLI +- OCNLI +- OCNLI_FC +- AX-b +- AX-g +- CB +- RTE +- ANLI + +
+ +
+常识推理 + +- StoryCloze +- COPA +- ReCoRD +- HellaSwag +- PIQA +- SIQA + +
+ +
+数学推理 + +- MATH +- GSM8K + +
+ +
+定理应用 + +- TheoremQA +- StrategyQA +- SciBench + +
+ +
+综合推理 + +- BBH + +
+
+
+初中/高中/大学/职业考试 + +- C-Eval +- AGIEval +- MMLU +- GAOKAO-Bench +- CMMLU +- ARC +- Xiezhi + +
+ +
+医学考试 + +- CMB + +
+
+ 理解 + + 长文本 + + 安全 + + 代码 +
+
+阅读理解 + +- C3 +- CMRC +- DRCD +- MultiRC +- RACE +- DROP +- OpenBookQA +- SQuAD2.0 + +
+ +
+内容总结 + +- CSL +- LCSTS +- XSum +- SummScreen + +
+ +
+内容分析 + +- EPRSTMT +- LAMBADA +- TNEWS + +
+
+
+长文本理解 + +- LEval +- LongBench +- GovReports +- NarrativeQA +- Qasper + +
+
+
+安全 + +- CivilComments +- CrowsPairs +- CValues +- JigsawMultilingual +- TruthfulQA + +
+
+健壮性 + +- AdvGLUE + +
+
+
+代码 + +- HumanEval +- HumanEvalX +- MBPP +- APPs +- DS1000 + +
+
+ +

🔝返回顶部

+ +## 📖 模型支持 + + + + + + + + + + + + + + +
+ 开源模型 + + API 模型 +
+ +- InternLM +- LLaMA +- Vicuna +- Alpaca +- Baichuan +- WizardLM +- ChatGLM2 +- Falcon +- TigerBot +- Qwen +- …… + + + +- OpenAI +- Claude +- PaLM (即将推出) +- …… + +
+ +

🔝返回顶部

+ ## 🔜 路线图 - [ ] 主观评测